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Preface 



Graphs are among the most important abstract data structures in computer sci- 
ence, and the algorithms that operate on them are critical to modern life. Graphs 
have been shown to be powerful tools for modeling complex problems because of 
their simplicity and generality. For this reason, the field of graph algorithms has 
become one of the pillars of theoretical computer science, informing research in 
such diverse areas as combinatorial optimization, complexity theory, and topology. 
Graph algorithms have been adapted and implemented by the military and com- 
mercial industry, as well as by researchers in academia, and have become essential 
in controlling the power grid, telephone systems, and, of course, computer networks. 

The increasing preponderance of computer and other networks in the past 
decades has been accompanied by an increase in the complexity of these networks 
and the demand for efficient and robust graph algorithms to govern them. To 
improve the computational performance of graph algorithms, researchers have pro- 
posed a shift to a parallel computing paradigm. Indeed, the use of parallel graph 
algorithms to analyze and facilitate the operations of computer and other networks 
is emerging as a new subdisciplinc within the applied mathematics community. 

The combination of these two relatively mature disciplines — graph algorithms 
and parallel computing — has been fruitful, but significant challenges still remain. 
In particular, the tasks of implementing parallel graph algorithms and achieving 
good parallel performance have proven especially difficult. 

In this monograph, we address these challenges by exploiting the well-known 
duality between the canonical representation of graphs as abstract collections of 
vertices with edges and a sparse adjacency matrix representation. In so doing, we 
show how to leverage existing parallel matrix computation techniques as well as 
the large amount of software infrastructure that exists for these computations to 
implement efficient and scalable parallel graph algorithms. In addition, and perhaps 
more importantly, a linear algebraic approach allows the large pool of researchers 
trained in fields other than computer science, but who have a strong linear algebra 
background, to quickly understand and apply graph algorithms. 

Our treatment of this subject is intended formally to complement the large 
body of literature that has already been written on graph algorithms. Nevertheless, 
the reader will find several benefits to the approaches described in this book. 

(1) Syntactic complexity. Many graph algorithms are more compact and are 
easier to understand when presented in a sparse matrix linear algebraic format. 
An algorithmic description that assumes a sparse matrix representation of the 
graph, and operates on that matrix with linear algebraic operations, can be readily 
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understood without the use of additional data structiires and can be translated into 
a program directly using any of a number of array-based programming environments 
(e.g., Matlab®). 

(2) Ease of implementation. Parallel graph algorithms are notoriously difficult 
to implement. By describing graph algorithms as procedures of linear algebraic 
operations on sparse (adjacency) matrices, all the existing software infrastructure 
for parallel computations on sparse matrices can be used to produce parallel and 
scalable programs for graph problems. Moreover, much of the emerging Partitioned 
Global Address Space (PGAS) libraries and languages can also be brought to bear 
on the parallel computation of graph algorithms. 

(3) Performance. Graph algorithms expressed by a series of sparse matrix 
operations have clear data-access patterns and can be optimized more easily. Not 
only can the memory access patterns be optimized for a procedure written as a 
series of matrix operations, but a PGAS library could exploit this transparency by 
ordering global communication patterns to hide data-access latencies. 

This work represents the first of its kind on this interesting topic of linear 
algebraic graph algorithms, and represents a collection of original work on the topic 
that has historically been scattered across the literature. This is an edited volume 
and each chapter is self-contained and can be read independently. However, the 
authors and editors have taken great care to unify their notation and terminology 
to present a coherent work on this topic. 

The book is divided into three parts: (I) Algorithms, (II) Data, and (III) Com- 
putation. Part I presents the basic mathematical framework for expressing common 
graph algorithms using linear algebra. Part II provides a number of examples where 
a linear algebraic approach is used to develop new algorithms for modeling and an- 
alyzing graphs. Part III focuses on the sparse matrix computations that underlie a 
linear algebraic approach to graph algorithms. The book concludes with a discus- 
sion of some outstanding questions in the area of large graphs. 

While most algorithms are presented in the form of pseudocode, when working 
code examples are required, these are expressed in Matlab, and so a familiarity 
with MATLAB is helpful, but not required. 

This book is suitable as the primary book for a class on linear algebraic graph 
algorithms. This book is also suitable as either the primary or supplemental book 
for a class on graph algorithms for engineers and scientists outside of the field of 
computer science. Wherever possible, the examples arc drawn from widely known 
and well-documented algorithms that have already been identified as representing 
many applications (although the connection to any particular application may re- 
quire examining the references). 

Finally, in recognition of the severe time constraints of professional users, 
each chapter is mostly self-contained and key terms are redefined as needed. Each 
chapter has a short summary and references within that chapter are listed at the 
end of the chapter. This arrangement allows the professional user to pick up and 
use any particular chapter as needed. 
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Chapter 1 

Graphs and Matrices 



Jeremy Kepner* 



Abstract 

A linear algebraic approach to graph algorithms that exploits the sparse 
adjacency matrix representation of graphs can provide a variety of ben- 
efits. These benefits include syntactic simplicity, easier implementation, 
and higher performance. Selected examples are presented illustrating 
these benefits. These examples are drawn from the remainder of the 
book in the areas of algorithms, data analysis, and computation. 

1 .1 Motivation 

The duality between the canonical representation of graphs as abstract collections of 
vertices and edges and a sparse adjacency matrix representation has been a part of 
graph theory since its inception [Konig 1931, Konig 1936]. Matrix algebra has been 
recognized as a useful tool in graph theory for nearly as long (see [Harary 1969] and 
the references therein, in particular [Sabadusi 1960, Wcischol 1962, McAndrew 1963, 
Teh & Yap 1964, McAndrew 1965, Harary & Trauth 1964, Brualdi 1967]). How- 
ever, matrices have not traditionally been used for practical computing with graphs, 
in part because a dense 2D array is not an efficient representation of a sparse graph. 
With the growth of efficient data structures and algorithms for sparse arrays and 

*MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420 (kepner Oil .mit . edu). 

This work is sponsored by the Department of the Air Force under Air Force Contract FA8721- 
05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author 
and are not necessarily endorsed by the United States Government. 
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G = (V,E) A' X A'x 

Figure 1.1. Matrix graph duality. 

Adjacency matrix A is dual with tlie corresponding grapli. In addition, 
vector matrix multiply is dual with breadth-first search. 

matrices, it has become possible to develop a practical array-based approach to 
computation on large sparse graphs. 

There are several benefits to a linear algebraic approach to graph algorithms. 
These include: 

1. Syntactic complexity. Many graph algorithms are more compact and are easier 
to understand in an array-based representation. In addition, these algorithms 
are accessible to a new community not historically trained in canonical graph 
algorithms. 

2. Ease of implementation. Array-based graph algorithms can exploit the exist- 
ing software infrastructure for parallel computations on sparse matrices. 

3. Performance. Array-based graph algorithms more clearly highlight the data- 
access patterns and can be readily optimized. 

The rest of this chapter will give a brief survey of some of the more interesting 
results to be found in the rest of this book, with the hope of motivating the reader 
to further explore this interesting topic. These results are divided into three parts: 
(I) Algorithms, (II) Data, and (III) Computation. 

1.2 Algorithms 

Linear algebraic approaches to fundamental graph algorithms have a variety of 
interesting properties. These include the basic graph/ adjacency matrix duality, 
correspondence with semiring operations, and extensions to tensors for representing 
multiple-edge graphs. 

1.2.1 Graph adjacency matrix duality 

The fundamental concept in an array-based graph algorithm is the duality between 
a graph and its adjacency representation (see Figure 1.1). To review, for a graph 
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G = {V, E) with N vertices and M edges, the N x N adjacency matrix A has the 
property A(i,j) = 1 if there is an edge Cij from vertex Vi to vertex vj and is zero 
otherwise. 

Perhaps even more important is the duahty that exists with the fundamental 
operation of hnear algebra (vector matrix multiply) and a breadth-first search (BFS) 
step performed on G from a starting vertex s 

BFS{G, s) ^ A^v, v(s) = 1 

This duality allows graph algorithms to be simply recast as a sequence of linear 
algebraic operations. Many additional relations exist between fundamental linear 
algebraic operations and fundamental graph operations (see chapters in Part I). 

1.2.2 Graph algorithms as semirings 

One way to employ linear algebra techniques for graph algorithms is to use a broader 

definition of matrix and vector multiplication. One such broader definition is that 
of a semiring (see Chapter 2). In this context, the basic multiply operation becomes 
(in Matlab notation) 

A opi.op2 V 

where for a traditional matrix multiply opi = + and op2 = * (i.e., Av = A + . * v). 

Using such notation, canonical graph algorithms such as the Bellman Ford shortest 
path algorithm can be rewritten using the following semiring vector matrix product 
(see Chapters 3 and 5) 

d = d + . min A 

where the x 1 vector d holds the length of the shortest path from a given starting 
vertex s to all the other vertices. 

More complex algorithms, such as betweenness centrality (see Chapter 6), can 
also be effectively represented using this notation. In short, betweenness centrality 
tries to measure the "importance" of a vertex in a graph by determining how many 
shortest paths the vertex is on and normalizing by the number of paths through the 
vertex. In this instance, we see that the algorithm effectively reduces to a variety 
of matrix matrix and matrix vector multiplies. 

Another example is subgraph detection (see Chapter 8), which reduces to a 
series of "selection" operations 

Row selection: A diag(v) 

Col selection: diag(u) A 

Row/Col selection: diag(u) A diag(v) 
where diag(v) is a diagonal matrix with the values of the vector v along the diagonal. 
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Vertex Degree 

Figure 1.2. Power law graph. 

Real and simulated in-degree distribution for the Epinions data set. 

1.2.3 Tensors 

In many domains (e.g., network traffic analysis), it is common to have multiple edges 
between vertices. Matrix notation can be extended to these graphs using tensors 
(see Chapter 7). For example, consider a graph with at most Nk edges between any 
two vertices. This graph can be represented using the N x N x Nk tensor X where 
3C(i, J, k) is the kth edge going from vertex i to vertex j. 

1.3 Data 

A matrix-based approach to the analysis of real-world graphs is useful for the sim- 
ulation and theoretical analysis of these data sets. 

1 .3.1 Simulating power law graphs 

Power law graphs are ubiquitous and arise in the Internet, the web, citation graphs, 
and online social networks. Power law graphs have the general property that the 
histograms of their degree distribution Deg{) fall off with a power law and are 
approximately linear in a log- log plot (see Figure 1.2). Mathematically, this obser- 
vation can be stated as 

Slope[\og{Count[Deg{g)])] ~ —constant 
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Efficiently generating simulated data sets that satisfy this property is difficult. 
Interestingly, an array-based approach using Kronecker products natiu-ally pro- 
duces graphs of this type (see Chapters 9 and 10). The Kronecker product graph 
generation algorithm can be described as follows. First, let A : R^^^^'^c^NeNc ^ 
B : M^bxJVb^ and C : R^fcxWc, Then the Kronecker product is defined as follows: 



A = B(g)C = 



62,lC 



6i,2C 

52,2C 



^'2,MbC 



bNB,MB^ / 



Now let G : R-^^^ be an adjacency matrix. The Kronecker exponent to the power 
k is as follows 



which generates an N'' x N'' adjacency matrix. 



1.3.2 Kronecker theory 

It would be very useful if it were possible to analytically compute various centrality 
metrics for power law graphs. This is possible (see Chapter 10), for example, for 
Kronecker graphs of the form 

(B(n,m)+I)®*^ 

where I is the identity matrix and B(n,m) is the adjacency matrix of a complete 
bipartite graph with sets of n and m vertices. For example, the degree distribution 
(i.e., the histogram of the degree centrality) of the above Kronecker graph is 

Count[Deg ={n+ iy{m + 1)'="'^] = 

for r = 0, . . . , fc. 




1 .4 Computation 

The previous sections have given some interesting examples of the uses of array- 
based graph algorithms. In many cases, these algorithms reduce to various sparse 
matrix multiply operations. Thus, the effectiveness of these algorithms depends 
upon the ability to efficiently run such operations on parallel computers. 

1.4.1 Graph analysis metrics 

Centrality analysis is an important tool for understanding real-world graphs. Cen- 
trality analysis deals with the identification of critical vertices and edges (see Chap- 
ter 12). Example centrality metrics include 
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Degree centrality is the in-degree or out-degree of the vertex. In an array 
formulation, this is simply the sum of a row or a column of the adjacency 
matrix. 

Closeness centrality measures how close a vertex is to all the vertices. For 
example, one commonly used measure is the reciprocal of the sum of all the 
shortest path lengths. 

Stress centrality computes how many shortest paths the vertex is on. 

Betweenness centrality computes how many shortest paths the vertex is on 
and normalizes this value by the number of shortest paths to a given vertex. 

Many of these metrics are computationally intensive and require parallel implemen- 
tations to compute them on even modest-sized graphs (see Chapter 12). 

1 .4.2 Sparse matrix storage 

An array-based approach to graph algorithms depends upon efhcient handling of 
sparse adjacency matrices (see Chapter 13). The primary goal of a sparse matrix 
is efficient storage that is a small multiple of the number of nonzero elements in 
the matrix M. A standard storage format used in many sparse matrix software 
packages is the Compressed Storage by Columns (CSC) format (see Figure 1.3). 
The CSC format is essentially a dense collection of sparse column vectors. Likewise, 
the Compressed Storage by Rows (CSR) format is essentially a dense collection of 
sparse row vectors. Finally, a less commonly used format is the "tuples" format, 
which is simply a collection of row, column, and value 3-tuples of the nonzero 



Matrix 



Compressed Storage by Columns 



31 


0 


53 


0 


59 


0 


41 


26 


0 



value: 



31 


41 


59 


26 


53 



row: 


1 


3 


2 


3 


1 

^ — 


colstart: 


1 


3 


5 


6 





Figure 1.3. Sparse matrix storage. 

The CSC format consists of three arrays: colstart, row, and value, 
colstart is an iV-element vector that holds a pointer into row which 
holds the row index of each nonzero value in the columns. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



1 .4. Computation 



9 



elements. Mathematically, the following notation can be used to differentiate these 
different formats 

\:]^s{N)xN sparse rows (CSR) 

A : R^X'^C^) sparse columns (CSC) 

A : R-5(^^^) sparse rows and columns (tuples) 



1.4.3 Sparse matrix multiply 

In addition to efficient sparse matrix storage, array-based algorithms depend upon 
an efficient sparse matrix multiply operation (see Chapter 14). Independent of the 
underlying storage representation, the amount of useful computation done when two 
random N x N matrices with M nonzeros are multiplied together is approximately 
2M^/iV. By using this model, it is possible to quickly estimate the computational 
complexity of many linear algebraic graph algorithms. A more detailed model of 
the useful work in multiplying two specific sparse matrices A and B is 

flops{A • B) = 2 ^ nnz{A{:, k)) ■ nnz{B{k, :)) 

k=l 

where M = nnzQ is the number of nonzero elements in the matrix. Sparse matrix 
matrix multiply is a natural primitive operation for graph algorithms but has not 
been widely studied by the numerical sparse matrix community. 



1 .4.4 Parallel programming 

Partitioned Global Address Space (PGAS) languages and libraries are the natural 
environment for implementing array-based algorithms. PGAS approaches have been 
implemented in C, Fortran, C-|— 1-, and Matlab (see Chapter 4 and [Kepncr 2009]). 
The essence of PGAS is the ability to specify how an array is decomposed on a 
parallel processor. This decomposition is usually specified in a structure called a 
"map" (or layout, distributor, distribution, etc.). Some typical maps are shown in 
Figure 1.4. 

The usefulness of PGAS can be illustrated in the following Matlab example, 
which creates two distributed arrays A and B and then performs a data redistribution 
via the assignment operation 



Amap = map([Np 1] ,{},0:Np-l) ; Row map. 

Bmap = map([l Np] , {} , 0 : Np-1) ; "/ Column map. 

A = rand(N,N, Amap) ; "/ Distributed array. 

B = zeros (N,N, Bmap) ; "/ Distributed array. 

B(:,:) = A; 7. Redistribute A to B. 
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Figure 1.4. Parallel maps. 

A selection of maps that are typically supported in PGAS programming 
environments. 

Mathematically, we can write the same algorithm as follows 
A : K^Wx^ 
B : M^x^W 
B = A 

where P{) is used to denote the dimension of the array that is being distributed 
across multiple processors. 



1 .4.5 Parallel matrix multiply performance 

The PGAS notation allows array algorithms to be quickly transformed into graph 
algorithms. The performance of such algorithms can then be derived from the 
performance of parallel sparse matrix multiply (see Chapter 14), which can be 
written as 

A,B,C :M^(^x^) 
A^BC 

The computation and communication times of such an algorithm for random sparse 
matrices are 

TcompiNp) cx iM/N)M/Np 
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0 20 40 60 SO too 

Number of processors 

Figure 1.5. Sparse parallel performance. 

Triangles and squares show the measured performance of a parallel be- 
tweenness centrality code on two different computers. Dashed lines 
show the performance predicted from the parallel sparse matrix mul- 
tiply model showing the implementation achieved near the theoretical 
maximum performance the computer hardware can deliver. 

The resulting performance speedup (see Figure 1.5) on a typical parallel computing 
architecture then shows the characteristic scaling behavior empirically observed (see 
Chapter 11). 

Finally, it is worth mentioning that the above performance is for random 
sparse matrices. However, the adjacency matrices of power law graphs are far from 
random, and the parallel performance is dominated by the large load imbalance that 
occurs because certain processors hold many more nonzero values than others. This 
has been a historically difficult problem to address in parallel graph algorithms. 
Fortunately, array-based algorithms combined with PGAS provide a mechanism 
to address this issue by remapping the matrix. One such remapping is the two- 
dimensional cyclic distribution that is commonly used to address load balancing in 
parallel linear algebra. Using Pc() to denote this distribution, we have the following 
algorithm 

A,B,C : M^-(^x^) 
A = BC 

Thus, with a very minor algorithmic change: P() — > Pc{), the distribution of nonzero 
values can be made more uniform across processors. 
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More optimal distributions for sparse matrices can be discovered using auto- 
mated parallel mapping techniques (see Chapter 15) that exploit the specific distri- 
bution of non-zeros in a sparse matrix. 

1 .5 Summary 

This chapter has given a brief survey of some of the more interesting results to be 
found in the rest of this book, with the hope of motivating the reader to further 
explore this fertile area of graph algorithms. The book concludes with a final chapter 
discussing some of the outstanding issues in this field as it relates to the analysis of 
large graph problems. 
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Chapter 2 

Linear Algebraic Notation 
and Definitions 



Eric Robinson* , Jeremy Kepner*, and John Gilbert^ 



Abstract 

This chapter presents notation, definitions, and conventions for graphs, 
matrices, arrays, and operations upon them. 



2.1 Graph notation 

For the most part, this book will not distinguish between a graph and its adjacency 
matrix and will move freely between vertex/edge notation and matrix notation. 
Thus, a graph G can be written either as G = (V, E), where is a set of N vertices 
and -E is a set of M edges (directed edges unless otherwise stated), or as G = A, 
where A is an AT x AT matrix with M nonzeros, namely A{i,j) = 1 whenever 
is an edge. This representation will allow many standard graph algorithms to be 
expressed in a concise linear algebraic form. 
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kepnerOll . mit . edu) . 
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and are not necessarily endorsed by the United States Government. 
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Usually N will be the number of vertices and M the number of edges in a 
graph. There are several other equivalent notations 

Adjacency Matrix Vertex/Edge 

Parameter Variable Notation Notation 

number of vertices N \A\ \V\ 

number of directed edges M nnz{A) \E\ 

2.2 Array notation 

Most of the arrays (including vectors, matrices, and tensors) in this book have 
elements that are either boolean (from B), integer (from Z), or real (from K). The 
notation A : R^^^^^, for example, indicates that A is a 3D array of 210 real numbers, 
of size 5 by 6 by 7. 

Scalars, vectors, matrices, and tensors are considered arrays; we use the fol- 
lowing typographical conventions for them. 

Dimensions Name Typeface Example 



0 scalar italic lowercase s 

1 vector boldface lowercase v 

2 matrix boldface capital M 
3 or more tensor boldface script 7 



The ith entry of a vector v is denoted by v(i). An individual entry of a matrix M 
or a three-dimensional tensor T is denoted by M(i,j) or 7{i,j,k). We also allow 
indexing on expressions; for example, [(I — A)~^](i, j) is an entry of the inverse of 
the matrix I — A. 

We will often use the Matlab notation for subsections and indexes of arrays 
with any mrniber of dimensions. For example, A(l : 5, [3 1 4 1]) is a 5 x 4 array 
containing the elements in the first five rows of columns 3, 1, 4, and 1 (again) in 
that order. If I is an index or a set of row indices, then A(7, :) is the submatrix of 
A with those rows and all columns. 

2.3 Algebraic notation 

Here we describe the common algebraic structures and operations on arrays that are 
used throughout the book. Some individual chapters also define notation specific 
to their topics; for example. Chapter 7 introduces a number of additional types of 
matrix and tensor multiplication. 

2.3.1 Semirings and related structures 

A semiring is a set of elements with two binary operations, sometimes called "ad- 
dition" and "multiplication," such that 

• Addition and multiplication have identity elements, sometimes called 0 and 
1, respectively. 
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• Addition and multiplication are associative. 

• Addition is commutative. 

• Multiplication distributes over addition from both loft and right. 

• The additive identity is a multiplicative annihilator, 0*a = a*0 = 0. 

Both R and Z arc semirings luidcr their usual addition and multiplication opera- 
tions. The booleans B are a semiring under A and V, as well as under V and A. If 
M and Z are augmented with +00, they become semirings with min for "addition" 
and + for "multiplication." Linear algebra on this (min, +) semiring is often useful 
for solving various types of shortest path problems. 

We often write semiring addition and multiplication using ordinary notation 
as a + 6 and a * b ot just ab. When this could be ambiguous or confusing, we 
sometimes make the semiring operations explicit. 

Most of matrix arithmetic and much of linear algebra can be done in the 
context of a general semiring. Both more and less general settings are sometimes 
useful. We will see examples that formulate graph algorithms in terms of matrix 
vector and matrix matrix multiplication over structures that are semiring-like except 
that addition is not commutative. We will also see algorithms that require a semiring 
to be closed, which means that the equation x = 1 + ax has a solution for every 

a. Roughly speaking, this corresponds to saying that the sequence 1 + a + a^ -\ 

converges to a limit. 

2.3.2 Scalar operations 

Scalar operations like a + b and ab have the usual interpretation. An operation 
between a scalar and an array is applied pointwise; thus a + M is a matrix the same 
size as M. 

2.3.3 Vector operations 

We depart from the convention of numerical linear algebra by making no distinction 
between row and column vectors. (In the context of multidimensional tensors, we 
prefer not to deal with notation for a different kind of vector in each dimension.) 
For vectors v : 'R^ and w : K^, the outer product of v and w is written as v o w, 
which is the M x N matrix whose element is v(z) * w(7). If M = A'^, the inner 
product V • w is the scalar v(i) * w(i). 

Given also a matrix M : R*'^^^, the products vM and Mw are both vectors, 
of dimension N and M, respectively. 

When we operate over semirings other than the usual (+, *) rings on M and 
Z, we will sometimes make the semiring operations explicit in matrix vector (and 
matrix matrix) multiplication. For example, M(min .+)w, or Mmin.+ w, is the 
M-vector whose zth element is min(M(z, j) ■ 1 < j < N). The usual matrix 

vector multiplication Mw could also be written as M + .* w. 
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2.3.4 Matrix operations 

Three kinds of matrix "multiplication" arise frequently in graph algorithms. All 
three are defined over any semiring. 

If A and B are matrices of the same size, the pointwise product (or Hadamard 
product) A.* B is the matrix C with C(i,j) = A{i,j) * B(i,j). Similar notation 
applies to other pointwise binary operators; for example, C = A./B has C{i,j) = 
A(i, j)/B(i, j), and A . + B is the same as A + B. 

If A is M X iV and B is iV x P, then AB is the conventional M x P matrix 
product. We sometimes make the semiring explicit by writing, for example, A+.*B 
or Amin .+ B. 

Finally, if A is Af x iV and B is P x Q, the Kronecker product A (g) B is the 
MP X NQ matrix C with C{i,j) = A(s, t) * B(w, w), where i = {s — 1)P + u and 
j — {t — 1)Q + V. One can think of A ® B as being obtained by replacing each 
element A(s,t) of A by its pointwise product with a complete copy of B. The 
Kronecker power A*^*^ is defined as the fc-fold Kronecker product A ig) A (g) • • • 0 A. 

It is useful to extend the "dotted" notation to represent matrix scalings. If A is 
an M X N matrix, v is an M- vector, and w is an TV- vector, then v .* A scales the rows 
of A; that is, the result is the matrix whose («, j) entry is v{i) * A{i,j). Similarly, 
A .* w scales columns, yielding the matrix whose entry is w(j) * A{i,j). In 
Matlab notation, these could be written diag(v) * A and A * diag(w). 

2.4 Array storage and decomposition 

Section 2.2 defined multidimensional arrays as mathematical objects, without ref- 
erence to how they are stored in a computer. When presenting algorithms, we 
sometimes need to talk about the representation used for storage. This section 
gives our notation for describing sparse and distributed array storage. 

2.4.1 Sparse 

An array whose elements are mostly zeros can be represented compactly by storing 
only the nonzero elements and their indices. Many different sparse data structures 
exist; Chapter 13 surveys several of them. 

It is often useful to view sparsity as an attribute attached to one or more 
dimensions of an array. For example, the notation A : ]r500xS(600) indicates that A 
is a 500 X 600 array of real numbers, which can be thought of as a dense array of 500 
rows, each of which is a sparse array of 600 columns. Figure 2.1 shows two possible 
data structures for an array A : Z'*^-^^**). A data structure for A : ]rS(500)x600 
would interchange the roles of rows and columns. An array A : K'S(500)xS(600)^ ^j. 
equivalently A : KSCsooxeoo)^ jg gp^j-gg both dimensions; it might be represented 
simply as an unordered sequence of triples {i,j,a) giving the positions and values 
of the nonzero elements. A three-dimensional array A : k500x600xS(700) jg dense 
two-dimensional array of 500 x 600 sparse 700-vectors. 

Sparse representations generally trade off ease of access for memory. Most data 
structures support constant-time random indexing along dense dimensions, but not 
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Figure 2.1. Sparse data structures. 

Two data structures for a sparse array A : 'Z'^^^^^K Left: adjacency 
lists. Right: compressed sparse rows. 

along sparse dimensions. The memory requirement is typically proportional to the 
number of nonzeros times the number of sparse dimensions, plus the product of the 
sizes of the dense dimensions. 

2.4.2 Parallel 

When analyzing the parallel performance of the algorithms described in this book, 
it is important to consider three things: the number of instances of the program 
used in the computation (denoted Np), the unique identifier of each instance of the 
program (denoted Pjo — 0, ... , Np — 1), and the distribution of the arrays used in 
those algorithms over those PjdS- Consider a nondistributed array A : R^^^. The 
corresponding distributed array is given in "P notation" as A : R^^^')^^ ^ where 
the first dimension is distributed among Np program instances. Figure 2.2 shows 
A : ]R^(i6)xi6 foj. ^ 4_ Likewise, Figure 2.3 shows A : m16xP(i6) ^j. ^ ^_ 

Block distribution 

A block distribution is the default distribution. It is used to represent the grouping 
of adjacent columns/rows, planes, or hyperplanes on the same Pjd- A parallel 
dimension is declared using P{N) or Pij{N). For A : R^^^')^^^ each row A(i, :) 
is assumed to reside on Pm = \i/\N/Np~\~\. Some examples of block distributions 
for matrices are provided. Figure 2.2 shows a block distribution over the rows of a 
matrix. Figure 2.3 shows a block distribution over the columns of a matrix. 

Cyclic distribution 

A cyclic distribution is used to represent distributing adjacent items in a distributed 
dimension onto different P/ds. For A : M^<:(^)x^^ each row A(z, :) is assumed to 
reside on P/d = (i — 1) mod Np. 

Some examples of cyclic distributions for matrices are provided. Figure 2.4 
shows a cyclic distribution over the rows of a matrix. Figure 2.5 shows a cyclic 
distribution over the columns of a matrix. 
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Figure 2.4. A row cyclic matrix. Figure 2.5. A column cyclic matrix. 
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Chapter 3 

Connected Components 
and Minimum Paths 



Charles M. Rader* 



Abstract 

A familiarity with matrix algebra is useful in understanding and in- 
venting graph algorithms. In this chapter, two very different examples 
of graph algorithms based on linear algebra are presented. Strongly 
connected components are obtained via efficient computation of infinite 
powers of the adjacency matrix. Shortest paths are computed using a 
modification of matrix exponentiation. 

3.1 Introduction 

Any graph can be represented by its adjacency matrix A. Familiarity with matrix 

operations and properties could therefore be a \iscful asset for solving some problems 
in graph theory. In this chapter, we consider two of the classical graph theory 
problems, finding strongly connected components, and finding minimum path lengths, 
from the point of view of linear algebra. 

To find strongly connected components, we will rely on a relationship in linear 
algebra 

oo 

(I-A)-i = 5]A" 

n=0 

which is the matrix analogy to the series identity 1/(1 — x)~l + x + x'^ + -- - . 

*MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420 (charlesmraderSverizon. 
net). 

This work is sponsored by the Department of the Air Force under Air Force Contract FA8721- 
05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author 
and are not necessarily endorsed by the United States Government. 
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Figure 3.1. Strongly connected components. 

An eight-node graph iUustrating strongly connected components. In the 
adjacency matrix of the graph row/col 1 is node a, row/col 2 is node b, 
etc. 

To compute minimum path lengths, we will need an invented operation that 
looks like matrix exponentiation, except that the two operations + and x are re- 
placed by min and respectively. A well-known shortcut algorithm for computing 
a power of a matrix will still work even after making that change. 

3.2 Strongly connected components 

For any directed graph, it is possible to group the nodes of the graph into maximal 
sets such that within each set of nodes there exist paths connecting any node to any 
other node in the same set. For example, for the eight-node graph in Figure 3.1, 
those sets are nodes a,b,e, nodes c,d, nodes f,g, and node h. Segregating the nodes 
of a graph into such sets is called finding strongly connected components. 

The graph in Figure 3.1 is taken from [Cormcn et al. 2001], page 553.* We 
will use it as an example in this chapter. 

For our example graph, the incidence matrix is 

0 0 " 

0 0 

1 0 
0 1 

0 0 

1 0 
0 1 
0 1 

Identifying strongly connected components can be accomplished using matrix 
operations. 

*T.H. Gormen, C.E. Leiserson, R. Rivest and C. Stein. Introduction to Algorithms, third 
edition, figure, page 553, © 2009 Massachusetts Insitute of Technology, by permission of the MIT 
Press. 
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Let's define another matrix C computed by using the clement wise or- function 
(V) of I, A, A^, A^, . . . , where a V 6 is 0 when both a and b are 0, and is 1 if either 
a or 6 is nonzero. The matrix C can be made up of an infinite number of terms 
(although a large enough finite number of terms would do). For the moment, let's 
not worry about how many terms we use 

C = IVAVA^VA^VA*V--- 

a'' has the property that A'^{i,j) > 0 if and only if tliere is a path from node i 
to node j in exactly k steps. If there are m ways to go from node i to node j in 
exactly k steps, then A*(i, j) = m. A'^{i,j) = 0 when there is no such path with 
exactly k steps. 

The resulting matrix C has a nonzero in position z, j if and only if there are 
paths from node i to node j in any number of steps — including 0 steps, because we 
included I in the series. 

Here's what C looks like in our example 

1 1 " 
1 1 
1 1 
1 1 
1 1 
1 1 
1 1 
0 1 

Now consider the element-by-element logical and function (A) of C and 

0 1 0 0 0 " 

0 10 0 0 

1 0 0 0 0 
1 0 0 0 0 
0 10 0 0 
0 0 110 
0 0 110 
0 0 0 0 1 

This is essentially the answer we seek. Row i of C A has a 1 in column k if and 
only if node i and node k belong to the same strongly connected set of nodes. 

3.2.1 Nondirected links 

If the links in the graph are all bidirectional, then A is a symmetric matrix. There- 
fore, C is symmetric. Hence C A is the same as C and that step can be omitted. 
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3.2.2 Computing C quickly 

Computing C = IVAVA^VA^VA^V-- 
compute 



• looks complicated. Suppose instead we 



D = I + A + A^ + A^ + A^ + • • • + A^ 

Then D has 0 where C has 0, and D has a nonzero positive entry where C has 1. 
So C{i,j) = (D(i, j) > 0). We can compute D and we instantly have C. Maybe D 
is easier to compute. 

Now let D be taken for an infinite number of terms. This series almost cer- 
tainly does not converge, but we will fix that later. For now consider E = (I — A)D 

E = D - AD 

= I + A + A^ + A^ + A'' + • • • 
- A - A2 - A3 - A^ - • • • 

= 1 

= (I-A)D 

Hence D = (I — A)~^. This is the compact representation of D, which we should 
be able to compute quickly using sparse matrix algorithms. 

Unfortunately, the method will almost always fail. The problem is that the 
series for D usually does not converge. But we can fix our convergence problem 
very easily. Let's introduce a, which is any positive number, replace A by aA, and 
redefine D 

D = I + (aA) + {aAf + {aAf + (aA)* + • • • 

taken for an infinite number of terms. Once again, we argue that the term (aA)'' 
has a nonzero positive entry in position i,j if and only if there is a path from node 
i to node j in k steps. So D has a nonzero positive entry in position i,j if and only 
if there is a path from node i to node j. 
Now we examine E = (I — aA)D 

E = I + aA + a^A^ + a^A^ + a^A^ + • • • 
- aA - a^A^ - a^A^ - a^A"^ 

= I 

= (I - aA)D 
Hence D = (I - aA)-^ 
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If we choose a sufBcicntly small, we are sure that all our infinite series converge, 
so all our math is valid. In the illustrated case, we used a = 0.5 and D was computed 
as 



(I-aA)-i = 
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and D has the same pattern of 0 and nonzero elements as C. 

3.3 Dynamic programming^ minimum paths^ and matrix 
exponentiation 

In this section, we will formulate dynamic programming [Bellman 1952], a method 
of finding the least cost paths through a graph, by using incidence matrices, and we 
will define a matrix operation which is isomorphic to matrix multiplication. Then 
we will extend that to an isomorphism to matrix exponentiation and show how it 
finds all the shortest paths in a graph. 

Suppose we have a directed graph, G, with N nodes and M edges, and an 
extended incidence matrix A (whose i,j entry is the cost, nonnegative, of going 
from node i to node j of the graph G in one step). We are interested in finding 
another matrix whose i,j entry is the least possible cost of going from node i to 
node j in whatever number of steps gives the least cost. The entries A(«,i) on the 
diagonal of matrix A will be 0, as we will explain later. 

When we multiply a matrix A by another matrix B, component i, j of the 
resulting matrix matrix product C is 

k 

Let's define a different way of matrices operating on one another 

C = A min . + B 

C(j,j) =mm{ A(i,A:) + B(fc,j) } (3.1) 

The min.+ operation is isomorphic to matrix matrix multiplication. The roles 
played by scalar multiplication and scalar addition in a conventional matrix matrix 
multiply are played by addition and minimum in the min.+ operation 

— !■ min; * — > + 

fc 

Next we will apply the min .+ operation to finding the costs of shortest paths in G. 
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When the graph G has no edge from node i to node j, the incidence matrix 
A would normally have A[i,j) = 0. But that would mean that there was no cost 
to go from one node to another node not even connected to it. We would rather 
have missing edges left out of the minimization defined in equation (3.1). Instead 
of formally changing equation (3.1), it is easiest to use a modified incidence matrix 
for which a left-out edge has A(i,j) = oo. Fortunately, it is not necessary to have a 
computer code representing oo. Any sufficiently large number, larger than the cost 
of any — 1 step path, would serve as well. 

Let's consider C = A min . + A 

C(z,j) =min{ A(2,fc)+A(fc,j) } 

k 

A{i,k) is the weight along a path from node i to node k and A(fc,j) is the 
weight along a path from node k to node j. Then A{i, k) + A{k, j) is the sum of the 
weights (hereafter called the cost) encountered along the two-step path from node 
i to node j passing through node fc, and C{i, j) = minfc{ A{i, k) + A{k,j) } is the 
cost along the two-step path from node i to node j for which that cost is minimum. 

It follows that C will have in position i,j the lowest cost of moving from 
node i to node j in exactly two steps. But recall that earlier we decided that A(i, i) 
would be zero. So the minimization in C{i,j) — minfc{ A(j, k) + A(fc, j) } includes 
k = i and k = j, and these represent the cost of going from i to j in only one step. 
Therefore, C = A min. + A will have in position i,j the lowest cost of moving 
from node i to node j in no more than two steps. We could consider some path 
as using two steps even when one of the steps goes from node i to itself. In the 
remainder of this discussion, when we refer to paths of a given length, that will 
always include all shorter paths as well. 

Now we introduce the notation A*". We will define 

A^i = A, A^2 = A min . + A 

and in general 

A^" = A min.+ {A'"-''-^^) 

A{i,j) has the minimum cost of all one-step paths from node i to node j 
because there is only one such path. We have the beginning of a pattern. A*^ gives 
the minimum costs for one step, and A*^ gives the minimum costs for two steps. 
We will next show that the pattern continues, e.g., A^'' gives the minimum costs 
for three steps, etc. 

We will later show that A*" shares an important property with matrix ex- 
ponentiation; namely, we will prove that for any nonnegative p and q, A*^^''"''' = 
A^P min.+ A^9. 

First, we simply explain why the i, j entry of the matrix A*" is the cost of the 
least cost path from node i to node j using n steps. We do this by mathematical 
induction. We assume that the statement is true for n and show that it must then 
be true for n -|- 1. We already know that the statement is true for n = 1 since that 
is how A was defined. 

The smallest one-step path cost from node i to any node k is given in A, 
and the smallest n step path cost from any node k to node j is given in B = 
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A^". But there must be some node k on the smaUest cost n + 1 step path from 
node i to any node j. Bellman's principle tells us that all subpaths on an optimum 
path must be optimum subpaths. So if node k is on the optimum path from i to 
j, the one step subpath from i to k must be (trivially) optimum and its cost is 
A{i,k), and the subpath of n steps from k to j must be optimum and its cost is 
'B{k,j). The computation A min .+ B explicitly determines which node k optimizes 
A(i, k) + B(/c, j), which is the cost of a path with n + 1 steps. Q.E.D. 

3.3.1 Matrix powers 

We could compute B — A*-^ by the following loop 

B = A 
for j = 2 : if 
do 

B = A min . + B 

As we have seen, B(i, j) would contain the cost of the least cost K step path from 
node i to node j. But suppose we are interested in the absolute least cost. It 
would never require more than A'^ — 1 steps to get from one node to another at least 
cost. Therefore, A^" does not change as n increases beyond n — N — 1. [Note: 
it is assumed that it is possible to get from node i to node j. If a graph consists 
of several disconnected subgraphs, then there are nodes i,j for which the absolute 
least cost path, with at most — 1 steps, would have to include one impossible 
step, e.g., it would have to include one step with infinite costs.] 

B — A*^ contains the costs of least cost paths, but does not contain the 
paths themselves. The information about paths was thrown away when, in equation 
(3.1), we saved only the minimum A(z, fc) +B(fc, j) and did not record the k which 
achieved it. If we want to identify the optimum paths, whenever we perform a 
min .+ calculation B = A min . + B, we simply maintain a second matrix D whose 
i,j element is the k for which the minimum was achieved. There may sometimes 
be ties, but if a tie is arbitrarily broken, then one of the minimum cost paths will 
be identified. 

Now we return to prove the claim that for any nonnegative p and q, A**^^'^'^'' = 
A^P min . + A^'^. But this is simply another statement of Bellman's principle. For 
any k, R = A^p has the optimum p-step path cost R(«, k) between node i and 
node fc. S = A*'' has the optimum g-step path cost S(fc,j) between node k and 
node j. If an optimum path ofp + q steps passes through some node k, the subpaths 
from node i to node k and from node k to node j must be optimum subpaths. The 
R min . + S operation simply finds the one node k that optimizes R,(«, k) + S(k, j). 
By Bellman's principle, this must be the cost of the least cost path oi p + q steps 
and so it is also A^^p+'J^ . 

It is in that sense that we consider A*" to be isomorphic to matrix exponen- 
tiation. 

This gives us other more efficient ways to compute A*-^ . If we choose K to be 
the smallest power of two greater than or equal to iV — 1, we can then economically 
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compute the matrix whose i,j element is the absolute optimum path cost from 
node i to node j 

B(°) = A 

BW =B(°) min.+ B(o) =A^2 
B(2) =B« min.+ bW =A*4 



BM =B(i) min.+ B(i) = A*2- 

and stop when we have computed A^^ where 2'' > — 1. 

Again, we note that A*^ gives the costs of the optimum paths, not the paths 
themselves. If wc want to recover the paths, we must save, in a set of matrices 
D'^™\ m = 1, . . . , r, the values of k, which were optimum in each min .+ computa- 
tion. That is, we redefine the min .+ operation as [C, D] = A min . + B so that at 
the same time we compute 

C{i,j) = min{ A{i,k) + A{k,j) } 

k 



T){i,j) = argmin{ A{i,k) + A{k,j) } 



Now our algorithm is 



B(0) = A 



B(i),d(i) 



B(o) min.+ 



B(2),d(2) =bW min.+ B^) 



If wc want to recover the overall optimum path from node i to node j, we 
look first at D^^\i,j) to find which node k is "halfway" from node i to node j. 
Then in D^"""^' we look at D(''~^)(i, k) to find the node halfway between node i and 
node k, and we also look at T>^^~^\k,j) to find the node halfway between node k 
and node j, and so on. 



3.4 Summary 

If a graph is represented by its incidence matrix, some algorithms that find proper- 
ties of a graph may be expressible as operations on matrices. So, often a familiarity 
with matrix algebra will be useful in understanding some graph algorithms, or even 
in inventing some graph algorithms. In this chapter, we have given two very diiferent 
examples of graph algorithms based on familiarity with linear algebra. 

In the first example, finding strongly connected components of a graph, we 
reduced the graph problem to computing a sum of powers of the incidence matrix, 
and then we recognized a matrix identity which helped us compute that sum of 
powers in a simple closed form. It is worth noting that the i^th power of an incidence 
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matrix gives, in position the number of different paths that can possibly be taken 
to traverse from node i to node j in exactly K steps. 

In the second example, finding minimum path lengths, the usual definitions of 
matrix multiplication and addition were not \iscful, but the operation that was useful 
was isomorphic to standard matrix multiplication, using the min operation instead 
of addition, and the + operation in place of multiplication, but the basic steps were 
ordered just as in matrix multiplication. We then showed that the computation of 
minimum path lengths was isomorphic to computation of a power of the incidence 
matrix, and then we were able to see that repeated squaring, the shortcut method 
of computing a power of a matrix, is isomorphic to a similar shortcut for finding all 
shortest paths. 
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Chapter 4 

Some Graph Algorithms in 
an Array- Based Language 



Viral B. Shah*, John Gilbert^ , and Steve Reinhardt^ 



Abstract 

This chapter describes some of the foundations of linear algebraic graph 
algorithms and presents a number of classic graph algorithms using 
Matlab style syntax. These algorithms are implicitly parallel, pro- 
vided the underlying parallel matrix operations are supported in the 
array-based language. 

4.1 Motivation 

High performance applications increasingly combine numerical and combinatorial 
algorithms. Past research on high performance computation has focused mainly on 
numerical algorithms, and we have a rich variety of tools for high performance nu- 
merical computing. On the other hand, few tools exist for large-scale combinatorial 
computing. Our goal is to allow scientists and engineers to develop applications 
using both numerical and combinatorial methods with as little effort as possible. 

Sparse matrix computations allow structured representation of irregular data 
structures, decompositions, and irregular access patterns in parallel applications. 
Sparse matrices are a convenient way to represent graphs. Since sparse matrices are 
first-class citizens in Matlab and many of its parallel dialects [ . 'hoy & Edelman 2005] , 
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it is natural to use the duality between sparse matrices and graphs to develop a uni- 
fied infrastructure for numerical and combinatorial computing. 

Several researchers are building libraries of parallel graph algorithms: the Par- 
allel Boost Graph Library (PBGL) at Indiana University [Grcgor & Lumsdainc 2005], 
the SNAP library at the Georgia Institute of Technology [Badcr 2005], and the 
Multi-threaded Graph Library (MTGL) at Sandia National Laboratories (see 
[Fleischer et al. 2000]). PBGL uses MPI for parallelism and builds upon the Boost 
graph library [Sick et al. 2002]. Both SNAP and MTGL focus on thread-level par- 
allelism. 

Our approach relies upon representing graphs with sparse matrices. The effi- 
ciency of our graph algorithms thus depends upon the efficiency of the underlying 
sparse matrix implementation. We use the distributed sparse array infrastructure 
in Star-P (see [Shall & Gilbert 2005]). 

Parallelism arises from parallel operations on sparse matrices. This yields 
several advantages. Graph algorithms are written in a high-level language (here, 
Matlab), making codes short, simple, and readable. The data-parallel matrix code 
has a single thread of control, which simplifies writing and debugging programs. The 
distributed sparse array implementation in Star-P provides a set of well-tested 
primitives. 

The primitives described in the next section are used to implement several 
graph algorithms in our "Knowledge Discovery Toolbox" (KDT, formerly GAPDT); 
see [Gilbert et al. 2007a, Gilbert ct al. 2007b]. High performance and interactivity 
are salient features of this toolbox. KDT was designed from the outset to run 
interactively with terascale graphs via Star-P. Much of KDT scales to tens or 
hundreds of processors. 

4.2 Sparse matrices and graphs 

Every sparse matrix problem is a graph problem, and every graph problem is a 
sparse matrix problem. 

A graph consists of a set of nodes V, connected by a set E of directed or undi- 
rected edges. A graph can be specified by triples: (u, v, w) represents a directed 
edge of weight w from node u to node v; the edge is a loop if m = v. This corre- 
sponds to a nonzero w at location (u, v) in a sparse matrix. An undirected graph 
corresponds to a symmetric matrix. Table 4.1 lists some corresponding matrix and 
graph operations. 

The sparse matrix implementations of both Matlab and Star-P (see 
[Gilbert et al. 1992, Shah & Gilbert 2005]) attempt to follow two rules for com- 
plexity. 

• Storage for a sparse matrix should be proportional to the number of rows, 
columns, and nonzero entries. 

• A sparse matrix operation should (as nearly as possible) take time propor- 
tional to the size of the data accessed and the number of nonzero arithmetic 
operations. 
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Table 4.1. Matrix/graph operations. 

Many simple sparse matrix operations can be used directly to perform 
basic operations on graphs. 



Matrix operation 


Graph operation 


G = sparse (U, V, W) 
[U, V, W] = find(G) 
vtxdeg = sum(spones (G) ) 
indeg = sum(spones (G) ) 
outdeg = sum(spones (G) , 2) 
N = G(i, :) 

Gsub = G (subset, subset) 

G(i, j) = w 

G(i, j) = 0 

G(I, I) = [] 

G = G(perm, perm) 

reach = G * start 


Construct a graph from an edge list 
Obtain the edge list from a graph 
Node degrees for an undirected graph 
Indegrees for a directed graph 
Outdegrees for a directed graph 
Find all neighbors of node i 
Extract a subgraph of G 
Add or relabel a graph edge 
Delete a graph edge 
Delete graph nodes 
Permute nodes of a graph 
Breadth-first search step 



Translating to graph language, we see that the storage for a graph is 0(|T^| + \E\), 
and elementary operations (ideally) take time linear in the size of the data accessed. 

4.2.1 Sparse matrix multiplication 

Sparse matrix multiplication can be a basic building block for graph computations. 
Path problems on graphs, for example, have been studied in terms of matrix opera- 
tions; see [Aho ct al. 1974, Tarjan 1981]. Specifically, sparse matrix multiplication 
over semirings can be used to implement a wide variety of graph algorithms. 

A semiring is an algebraic structure {S, ©, ®, 0, 1), where S" is a set of elements 
with binary operations © ( "addition" ) and (g) ( "multiplication" ) , and distinguished 
elements 0 and 1, that satisfies the following properties 

1. {S, ®, 0) is a commutative monoid with identity 0 

• associative: a © (6 ® c) = (a ® &) © c 

• commutative: a © & = & © a 

• identity: a(BO = 0(Ba — a 

2. (5,®, 1) is a monoid with identity 1 

• associative: a (g) (& (g) c) = (a (X) 6) (g) c 

• identity: a(g)l=:l(8)a = a 

3. (E) distributes over © 

• a (g) (6 © c) = (a (g) 6) © (a (g) c) 

• (6 © c) (g a = (6 (g a) © (c (g a) 

4. 0 is an annihilator under (g 

• ag)0 = Og)a = 0 
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Sparse matrix multiplication has the same control flow and data dependencies 
over any semiring, so it is straightforward to extend any matrix multiplication code 
to an arbitrary semiring. We have extended Star-P's sparse matrix arithmetic to 
operate over semirings. 

Here are a few of the semirings useful for graph algorithms 

• (M, +, X , 0, 1) is the usual real field, which is a semiring. 

• ({0, 1}, I, &, 0, 1) is the boolean semiring. It is useful in graph traversal algo- 
rithms such as breadth-first search. 

• (M U {oo}, min, -f-, oo, 0), sometimes called the tropical semiring, can be used 
to implement various shortest path algorithms. 

• (M U {oo},min, x,oo, 1) can be used for operations like selecting a subgraph 
or contracting nodes to form a quotient graph. Typically the graph's adja- 
cency matrix is multiplied by an indexing matrix containing only zeros and 
ones. The min operator decides the weight of collapsed edges; other operators 
can be substituted. This semiring can also be used to implement Cohen's 
algorithm [' 'ohon 1998] to estimate fill in sparse matrix multiplication. 

• Semirings over tuples can be used to compute actual shortest paths (rather 
than just their lengths); see, for example, Fineman and Robinson's chapter 
(Chapter 5) in this book. 

4.3 Graph algorithms 

In this section, we describe algebraic implementations of several computations on 
graphs that arise in combinatorial scientific computing. 

4.3.1 Breadth-first search 

A breadth-first search can be performed by multiplying a sparse matrix G with a 
sparse vector x. To search from node i, we begin with x(z) = 1 and x(j) = 0 for 
j ^ i. Then y = G'^ * x picks out row i of G, which contains the neighbors of node 
i. Multiplying y by G'^ gives nodes two steps away, and so on. Figure 4.1 shows 
an example. (The example uses the matrix A = G -I- 1 in place of G, which has the 
effect of selecting all nodes at distance at most k on the fcth step.) 

We can perform several independent breadth-first searches simultaneously by 
using sparse matrix matrix multiplication. Instead of the vector x, we use a ma- 
trix X with a column for each starting node. After Y — G"^ * X, column j of Y 
contains the result of a breadth-first search step from the node (or nodes) specified 
by column j of x. Using an efficient sparse matrix data structure, the time com- 
plexity of the search is the same as it would be with a traditional sparse graph data 
structure. 
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Figure 4.1. Breadth-first search by matrix vector multiplication, 

A sparse vector is intitialized with a 1 in the position of the start node. 
Repeated multiplication yields multiple breadth-first steps on the graph. 
The matrix can be either symmetric or asymmetric. 



4.3.2 Strongly connected components 

A strongly connected component of a directed graph is a maximal set of nodes 
all mutually reachable by directed paths. Every node belongs to exactly one such 
component. Tarjan's seminal paper [Tarjan 1972] uses depth-first search to find 
strongly connected components in linear time, but the depth-first search does not 
parallelize well. Instead, we implement a recursive divide-and-conquer algorithm 
due to Fleischer, Hendrickson, and Pinar [Fleischer ct al. 2000] that is efficient in 
practice on many realistic graphs, although not in the worst case. 

Node V is reachable from node u if there is a path of directed edges from u 
to V. The descendants of v are those nodes reachable from v. The predecessors of 
V are those nodes from which v is reachable. Descendants and predecessors can be 
found by breadth-first search, as in Algorithm 4.1. 

Algorithm 4.1. Predecessors and descendants. 

Predecessors of v are nodes from which v is reachable and are found by breadth-first 
search in G. Descendants are found by breadth-first search in G"^. 

1 function x = predecessor (G, v) 

2 7o Predecessors of a node in a graph 
3 

4 X = sparse (length(G) , 1); 

5 xold = x; 

6 x(v) = 1; 7. Start BFS from v. 
7 

8 while X ^= xold 

9 xold = x; 

10 X = X I G * x; 

11 end 
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The Fleischer-Hendrickson-Pinar algorithm chooses a "pivot" node at ran- 
dom and computes its descendants and predecessors. The set of nodes that are 
both descendants and predecessors of the pivot (including the pivot itself) forms 
one strongly connected component. It is easy to show that every remaining compo- 
nent consists either entirely of descendants, entirely of predecessors, or entirely of 
nodes that are neither descendants nor predecessors. Thus, the algorithm proceeds 
recursively on these three disjoint subsets of nodes. Algorithm 4.2 gives the code. 

Algorithm 4.2. Strongly connected components. 

1 function scomponents(G,map) 

2 "/o Strongly connected components of a graph 
3 

4 global label, count; 

5 

6 if nargin==l; map = l:length(G); end; % start recursion 

7 if is empty (G) ; return; end; 7, end recursion 
8 

9 V = 1 + fixCrand * length(G)); 7, random pivot 

10 pred = predecessor (G,v) ; 

11 desc = predecessor (G' ,v) ; 
12 

13 7o Intersection of predecessors+descendants is a component. 

14 sec = pred & desc; 

15 count = count + 1; 

16 label (map (sec)) = count; 
17 

18 7o Recurse on subgraph of predecessors 

19 remain = xor (pred, sec) ; 

20 scomponents(G(remain, remain) ,map(remain) ) ; 
21 

22 7o Recurse on subgraph of descendants 

23 remain = xor (desc , sec) ; 

24 scomponents(G(remain, remain) ,map(remain) ) ; 

25 

26 7o Recurse on subgraph of remaining nodes 

27 remain = ^(pred I desc); 

28 seomponents(G(remain, remain) ,map(remain) ) ; 

4.3.3 Connected components 

A connected component in an undirected graph is a maximal connected subgraph. 
Every node belongs to exactly one connected component. 

Wc implement the Awerbuch-Shiloach algorithm [Awcrbuch <i- Shiloach 1987] 
to find connected components of a graph in parallel. This algorithm builds up rooted 
trees of nodes, beginning with each node in its own tree and ending with one tree 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



4.3. Graph algorithms 



35 



for each component. The final trees are stars, which means that every tree node 
except the root is a child of the root. The set of trees is represented by a vector 
D that gives the parent of each node. If z is a root, D{i) = i. At the end, the 
root serves as a label for the component; each component consists of nodes with a 
common D{i). 

The method, shown in Algorithm 4.3, iterates a sequence of three steps until 
the trees stop changing. The first two steps, "conditional hooking" and "uncondi- 
tional hooking," combine pairs of trees into single trees. The third step, "pointer 
jumping," collapses each tree into a star. Roughly speaking, the conditional hook- 
ing step connects a star with a higher-numbered root to a lower-numbered node 
of an adjacent tree, and the unconditional hooking step forces every star that is 
not a complete component to connect to something. Awerbuch and Shiloach show 
that the algorithm terminates in 0(log |V^|) iterations. The total work is therefore 
Oi\E\\og\V\). 



4.3.4 Maximal independent set 

An independent set in an undirected graph is a set of nodes, no two of which 
are adjacent. An independent set is maximal if it is not a subset of any other 
independent set. 

We use Luby's randomized algorithm [Luby 1985] to compute a maximal in- 
dependent set (MIS), as shown in Algorithm 4.4. We begin by selecting nodes in 
the graph with probability inversely proportional to their degrees. If we select both 
endpoints of an edge, we deselect the lower-degree one. We add the remaining se- 
lected nodes to the independent set. We then iterate on the subgraph that remains 
after removing the selected nodes and their neighbors. 

Luby shows that, with high probability, each iteration eliminates at least 1/8 
of the edges, and therefore the number of iterations is almost certainly 0(log \V\). 



4.3.5 Graph contraction 

A contraction of a graph, also called a quotient graph, is obtained by merging subsets 
of nodes into single nodes. Contraction is a common operation in recursive compu- 
tations on graphs, appearing, for example, in some graph partitioning algorithms 
and in numerical multigrid solvers. 

As Algorithm 4.5 shows, graph contraction can be implemented in parallel by 
sparse matrix matrix multiplication. The input is a graph and an integer label for 
each node; the quotient graph merges nodes with the same label. The key is to 
form the sparse matrix S, which has a column for every node of the input graph 
and a row for every node of the quotient graph. 

Edges in the input graph merge when their endpoint nodes are contracted 
together. The code here sums the weights on the merged edges, but by using a 
different semiring, we could specify other ways to combine weights. 
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Algorithm 4.3. Connected components. 

1 7, Label nodes by connected components 

2 function D = components (G) 



3 D = l:length(G); [u v] = find (G) ; 

4 while true 

5 D = conditionalJiooking (D, u, v) ; 

6 star = f ind_stars (D) ; 

7 if nnz(star) == length (G) ; break; end; 

8 D = unconditionalJiooking (D, star, u, v) ; 

9 D = pointer_jumping (D) ; 
10 end; 



11 end % components () 

12 

13 % Hook higher roots to lower neighbors 

14 function D = conditionalJiooking (D, u, v) 



15 Du = D(u) ; Dv = D(v) ; 

16 hook = Du == D(Du) & Dv < Du; 

17 Du = Du(hook); Dv = Dv(hook) ; 

18 D(Du) = Dv; 



19 end % conditionalJiooking () 

20 

21 % Hook adjacent stars together 

22 function D = unconditionalJiooking (D, star, u, v) 



23 Du = D(u) ; Dv = D(v) ; 

24 hook = star(u) & Dv ~= Du; 

25 Du = Du(hook) ; Dv = Dv(hook) ; 

26 D(Du) = Dv; 



27 end 7, unconditionalJiooking () 

28 

29 7o Determine which nodes are in stars 

30 function star = find_stars (D) 

31 star = D == D(D) ; 

32 star(D) = star(D) & star; 

33 star(D) = star(D) & star; 

34 end 7. find_stars() 
35 

36 7o Shortcut all paths to point directly to roots 

37 function D = pointer .jumping (D) 



38 Dold = zeros (1, length (D) ) ; 

39 while any (Dold ~= D) 

40 Dold = D; 

41 D = D(D) ; 

42 end; 



43 end 7. pointer_jumping() 
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Algorithm 4.4. Maximal independent set. 

Luby's algorithm randomly selects low degree nodes to add to the MIS. Neighbors 
of selected nodes are then ignored while the computation proceeds on the remaining 
subgraph. Part of the code is omitted for brevity. 

1 function IS = mis (G) 

2 7, Maximal independent set of a graph 
3 

4 IS = [] ; 

5 while length (G) > 0 

6 7. Select vertices with probability l/(2*degree) 

7 degree = sum(G,2); 

8 prob =1 ./ (2 * degree); 

9 select = rand (length (G) , 1) <= prob; 

10 

11 7o Deselect one of each pair of selected neighbors 

12 neighbors = select & G * select; 

13 deselects = . . . ; °L lower degree neighbors 

14 if ~isempty (neighbors) ; select (deselects) = 0; end 
15 

16 7o Add selected nodes to independent set 

17 IS = [IS f ind(select)] ; 
18 

19 7o Exclude neighbors of selected vertices 

20 remain = not (select | G * select); 
21 

22 7o Iterate on the remaining subgraph 

23 G = G (remain, remain); 

24 end 

Algorithm 4.5. Graph contraction. 

1 function C = contract (G, labels) 

2 7o Contract nodes with the same label 
3 

4 n = length (G) ; 

5 m = max (labels) ; 

6 S = sparse (labels, l:n, 1, m, n) ; 

7 C=S*G*S'; 

8 end 

4.3.6 Graph partitioning 

It is often useful to divide a graph into two (or more) pieces by removing a small 
number of edges or vertices. Such partitions, usually continued recursively to pieces 
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and subpieces and so on, have been applied to load balancing for parallel com- 
putation, VLSI circuit layout, direct solvers for linear systems, and many divide- 
and-conquer graph algorithms. Finding the best partition for an arbitrary graph 
is intractable; many heuristic approaches have been developed. Our KDT toolbox 
includes three partitioning heuristics based on Matlab codes from the Meshpart 
toolbox [Gilbert & Teng 2002]. In each case, an efficient parallel implementation 
follows directly from the parallel array infrastructure of Star-P. 

KDT implements the geometric sphere partitioning algorithm of Miller et al. 
(see [Miller ct al. 1998]) for graphs whose nodes have coordinates in low-dimensional 
Euclidean space. This algorithm guarantees the quality of partitions for meshes 
whose elements are "well shaped" in a certain sense that includes most meshes 
used for numerical finite element methods in 2D and 3D. The algorithm parallelizes 
naturally; indeed, the sequential Matlab code from Meshpart [Gilbert ct al. 1998] 
works efficiently in Star-P with one minor modification to serialize computation 
on small matrices. 

KDT also implements a spectral partitioning algorithm for arbitrary graphs. 
Spectral partitioning methods have arisen independently in several different fields. 
The theory is based on ideas of Fiedler [Fiedler 1975]; Pothen, Simon, and Liou 
suggested applying spectral partitions to parallel computation [Pothen ct al. 1990]. 
Spectral partitioning begins with a graph's Laplacian matrix. The Laplacian L 
has off-diagonal elements Ly — —1 if is an edge of the graph, and L^- = 0 
otherwise. The diagonal element Jjn is the degree of node i; thus the rows of L sum 
to zero. The Laplacian is symmetric and positive semidefinite, so it has nonnegative 
real eigenvalues. The multiplicity of zero as an eigenvalue is equal to the number 
of connected components of the graph; if the graph is connected, zero is a simple 
eigenvalue. The smallest nonzero eigenvalue of a connected graph's Laplacian is 
called its Fiedler value, and the corresponding eigenvector is called a Fiedler vector. 

The simplest spectral partitioning method labels node i of the graph with the 
ith component of its Fiedler vector, and then partitions the nodes into equal-sized 
subsets around the median label. This choice can be heuristically justified as rep- 
resenting the continuous relaxation of the NP-complete discrete problem of finding 
the minimum-sized edge partition of the graph. This method, which is implemented 
in KDT's specpartO routine, tends to find good partitions in practice. A slightly 
more complicated recursive Fiedler vector method guarantees good partitions for 
planar graphs and well-shaped finite element meshes [Spielm;ui i: Teng 2007]. In 
Star-P, KDT uses the built-in eigensolver to obtain the Fiedler vector. This par- 
allelizes well, but it is nonetheless an expensive computation. 

KDT also includes the geometric spectral partitioner from the Meshpart 
toolbox. This algorithm first computes a small number k of eigenvectors of the 
graph Laplacian and uses them as node coordinates in /c-dimensional Euclidian 
space. It then uses the geometric sphere separator algorithm to find a small 
cut for the graph as embedded. This method was suggested by Ghan, Gilbert, 
and Teng [Chan ct al. 1994], who showed that it worked well on several example 
problems. 
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4.4 Graph generators 

Our KDT toolbox includes routines to generate three types of graphs that are useful 
as test cases for algorithms in various domains. The generators are fast and scalable; 
we use them to generate graphs ranging from a few nodes to billions of nodes. 

4.4.1 Uniform random graphs 

Our uniform random graph generator produces either directed or undirected graphs 
in which a specified number or density of edges are chosen uniformly at random, 
with replacement. This uses the logic of Matlab sprandO routine, which generates 
a set of edges and builds the sparse adjacency matrix efficiently via sparse (); in 
Star-P, this in turn uses an efficient parallel sorting routine [Cheng ct al. 2()0G]. 
For low densities, the resulting graphs have distribution similar to the Erdos-Renyi 
random graph model [Erdos & Rcnyi f95!)], in which each potential edge in the 
graph is created independently with fixed probability. 

4.4.2 Power law graphs 

For graphs with highly variable node degrees, we include a generator due to Kepner 
[Bad(n- ct al. 2004] for Leskovec et al.'s R-MAT graphs [Lcskovcc ct al. 2005]. With 
appropriate parameters, R-MAT produces graphs with power law degree distribu- 
tions similar to those occurring in many applications. For an R-MAT graph with 
2*^ nodes, the Matlab code uses k iterations of a simple vectorized computation 
to generate edges, and then calls sparse () to create the graph. The code runs 
efficiently in Star-P without modification. 

Figure 4.2 shows a density spy plot of an R-MAT graph. The recursive struc- 
ture is visible in the nonrandomized plot. Randomly relabeling the nodes destroys 
locality and structure. Figure 4.3 shows the degree distribution of an R-MAT graph. 
The performance plot (Figure 4.4) shows that the parallel R-MAT generator scales 
well, all the way to a billion nodes. These experiments were performed using 240 
processors of a 256-processor shared memory computer. 

4.4.3 Regular geometric grids 

The KDT toolbox also includes generators for two- and three-dimensional regular 
grids, which arise in many settings in physical modeling. The routines we provide are 
the same as those in the Meshpart toolbox [Gilbert & Tong 2(J02], which generate 
5-point, 7-point, and 9-point finite difference meshes on the unit square; equilateral 
triangular meshes; and cubic and tetrahedral meshes in three dimensions. The mesh 
generators also return the coordinates of the nodes in Euclidean space. 

The original mesh generators use a routine called blockdiagsO to build 
sparse matrices with specified diagonal and block diagonal structure. Our Star-P 
version of blockdiagsO uses Kronecker products (via the kronO routine) to gen- 
erate block diagonal matrices. The parallel code for sparse kron() is identical to 
the sequential code. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



40 



Chapter 4. Some Graph Algorithms in an Array-Based Language 




Figure 4.2. Adjacency matrix density of an R-MAT graph. 

The left image is an R-MAT graph with 1024 nodes and 6671 edges; 
note the recursive structure. The right image shows the same graph 
with nodes relabeled randomly. 
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Figure 4.3. Vertex degree distribution in an R-MAT graph. 
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Performance of R-MAT witfl 240 processors 
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Figure 4.4. Performance of parallel R-MAT generator. 
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Chapter 5 

Fundamental Graph 
Algorithms 



Jeremy T. Fineman* and Eric Robins o-n) 



Abstract 

This chapter discusses the representation of several fundamental graph 
algorithms as algebraic operations. Even though the underlying algo- 
rithms already exist, the algebraic representation allows for easily ex- 
pressible efficient algorithms with appropriate matrix constructs. This 
chapter gives algorithms for single-source shortest paths, all-pairs short- 
est paths, and minimum spanning tree. 

5.1 Shortest paths 

In shortest path problems, we are given weighted, directed graphs G = {V,E) 
with edge weights w : E ^ Moo where Moo = M U {00} , and a weight matrix 
W : M^^^ where W{u,v) = 00 if ^ E. For simplicity, W^Vjv) = 0 for 

all V € V. A path p from vq to Vk, denoted vq Vk, is a sequence of vertices 
p = (vq, vi,. . . , Vk) such that u,) e E. We define the weight of the path to 

be w{p) = Yli=i Uj). We say that the size of the path pis k hops, denoted 

by |p| = k. 

The shortest path distance (or shortest path weight) from m to u is given by 

A(u v) — \ '^'^{'^iP) '■ v} if a path from u to v exists 
' \ 00 otherwise 

A shortest path from u to f is a path u v with w(p) = A(w, v). 



*School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 (jf inemanS 
cs . emu. edu). 
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This section considers the Bellman-Ford algorithm for single-source shortest 
paths and the Floyd- Warshall algorithm for all-pairs shortest paths. The algebraic 
variants given here of both these algorithms achieve the same running times as the 
standard graph notations, e.g., as given in [Cormcn et al. 2001]. This section does 
not include Dijkstra's algorithm (which restricts the input to graphs with positive 
edge weights) since Section 5.2 gives the remarkably similar Prim's algorithm. 



The Bellman-Ford algorithm [Bellman 1958, Ford & Fulkerson 1962] solves the 
single-source shortest paths problem. Given a graph G = {V, E) with edge weights 
w and a designated source s & V , this algorithm determines whether there is 
a negative- weight cycle reachable from s. If there is no negative-weight cycle, 
Bellman-Ford returns the shortest path distances A(s,ti) for all w e V and the 
corresponding paths. 

One standard presentation of the Bellman-Ford algorithm is given by Algo- 
rithm 5.1. Bellman-Ford stores for each vertex v an estimate d(w) on the shortest 
path distance, maintaining that d(w) > A(s, v). The algorithm performs a sequence 
of "edge relaxations," after which d(w) = A(s,w). To relax the edge {u,v) simply 
means that d{v) — min{d(w), d(u)-f W(u, w)}. In particular, Bellman-Ford consists 
of N iterations, relaxing all edges in each iteration (in arbitrary order), lines 8-10 
simply give the implementation of an edge relaxation. We use 7r(u) to store the 
parent of v in the shortest path tree. Upon completion of the algorithm, tt can be 
used to find the shortest paths and not just the distances. 

Algorithm 5.1. Bellman Ford. 

A standard implementation of the Bellman-Ford algorithm ['''ormen ct al. 2001]. 

Bellman-Ford(F, E, w, s) 

1 foreach v G V 

2 do d{v) — oo 

3 Tr{v) ~ NIL 
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4 
5 
6 
7 
8 
9 
10 
11 
12 
13 



d(s) = 0 



for fc = 1 to iV - 1 

do foreach edge {u, v) E E 
do o Relax {u, v) 

if d{v) > d(w) + W(u, u) 

then d{v) — d{u) + W{u, v) 
7r(u) = u 

foreach edge {u, v) G E 

do if d{v) > d{u)+W{u,v) 

then return "A negative-weight cycle exists. 
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Our algebraic formulation of Bellman-Ford more closely resembles a dynamic 
programming interpretation. We define 

^ / \ _ \ Tnm{w{p) : u v,\p\ < k} if a (< /c)-hop path from u to u exists 
' \ oo otherwise 

to be the shortest path distance from u to f using at most k edges. If A„(s,w) < 
^Ar-i(s, v) for any u, then a negative-weight cycle exists. Otherwise, A.n^i{s, v) = 
A{s,v). 

Computing A.k{s,v) from A/£_i(s, w) is logically identical to relaxing all edges 
incident on v. In particular 

Afe(s, v) = min{Afe_i(s, u) + W{u, v)} (5.1) 

u 

Note that A^ can be computed from only Afc_i, so all other A^ can be discarded. 



Algebraic Bellman-Ford 

To represent Bellman-Ford as matrix and vector operations, we use the (sparse) 
N X N adjacency matrix A to store the edge weights and a 1 x vector to store 
shortest (< fc)-hop path distances Afe(s, *). Here, the values along the diagonal of 
the adjacency matrix are zeros. 

Translating equation (5.1) to a vector matrix product is relatively straight- 
forward. We have dk{v) = minvueAr(dfe_i(u) -I- A(u,i;)), which is just the product 
dfc('y) = dfe_i min.-l- A{:,v). Thus, we have d^ = dfc_i min.-|- A. 

We can represent the shortest path distance as d = dgA^^^, where each 
multiplication is min.-|- , and 

, , / 0 if V = s 
^ ' oo otherwise 

The expression d = dgA^ yields two natural algorithms for computing single- 
source shortest path distances. The first is the faithful algebraic representation of 
Bellman-Ford given in Algorithm 5.2. This algorithm uses 0{N) dense vector, 
sparse matrix multiplications, so the running time is Q{NM) comparisons and 
additions matching the regular BcUman-Ford. 



Algorithm 5.2. Algebraic Bellman Ford. 

An algebraic implementation of the Bellman-Ford algorithm. 

Bellman-Ford(A, s) 

1 d = oo 

2 d(s)=0 

3 for fc = 1 to iV - 1 

4 do d = d min.+ A 

5 if d 7^ d min.+ A 

6 then return "A negative-weight cycle exists." 

7 return d 
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The second algorithm is to compute A first by repeated squaring. Note 
that this approach actuahy computes ah-pairs shortest paths first, and the product 
d min.+ (A^) just selects a single row from the matrix A^. This algorithm re- 
quires log N dense matrix multiplies that require Q{N^ log N) work. Note that from 
the perspective of total work, the algebraic Bellman-Ford given in Algorithm 5.2 
asymptotically outperforms the repeated squaring in the worst case, but repeated 
squaring may allow for more parallelism. 

5.1 .2 Computing the shortest path tree (for Bellman-Ford) 

In addition to computing shortest path distances, Bellman-Ford and other shortest 
path algorithms find shortest paths. These paths are typically stored using the 
parent pointers tt, comprising a shortest path tree. We now extend our algorithms 
to compute the shortest path tree. This section describes shortest path trees in 
the context of Bellman-Ford, but the same techniques can be trivially applied to 
Floyd-Warshall, for example. 

In the absence of 0-weight cycles, computing the shortest path tree is relatively 
easy. For each v ^ s with A(s, v) ^ oo, we simply set tt{v) — u such that ci(u) + 
W(m, v) = d(w) for each v ^ s. Since d{v) — d{u) + W(u, v) for some u ^ v ^ V , 
and d[v) > d{u) + W(u,t)) for all u € we have 7r('i;) = argmin„_^j,{d(ii) + 
W(w,w)}. We represent this expression algebraically as 

n — d argmin.+ (A + diag(oo)) (5-2) 

We add diag(cx)) — the matrix having oo's along the diagonal — to eliminate self- loops 
from consideration. Note that 7r(s) has incorrect value, so we must set 7r(s) = nil. 
Similarly, for any v with A(s,u) = cx), the value at 7r(w) is incorrect, which must 
also be remedied.* 

In a graph with 0-weight cycles, using equation (5.2) may not yield a shortest 
path tree (even after resolving complications induced by unreachable vertices). In 
particular, the argmin may select vertices forming 0-weight cycles. 

We have two approaches for computing the parent pointers on arbitrary graphs 
(without negative- weight cycles). The first approach entails augmenting our opera- 
tors to operate on 3-tuples rather than real- valued weights. This 3-tuple approach 
operationally follows the standard Bellman-Ford algorithm by updating parent 
pointers as the distances decrease, and it results in only a constant-factor overhead 
decrease in performance.^ The second approach uses ideas from equation (5.2), 
pushing off the shortest-path-tree computation until the end. 

Both approaches have their advantages. The 3-tuple approach uses a single 
semiring for all operations and needs only to introduce tuples and change the op- 
erations used from our algebraic Bellman-Ford. The other approach uses simpler 
2-tuples, but we must operate over more than one semiring. Moreover, the second 

Assuming every vertex is reachable from s, there is no vertex with A(s, v) = oo. 
tNote that an implementation does not have to explicitly modify the input adjacency matrix. 
Instead, we can modify the matrix-primitive implementations to treat three same-sized matrices 
as a single matrix of 3-tuples. 
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approach involves some additional cleanup to fix some of the incorrect settings to 
TT (as mentioned for equation (5.2)). We give both approaches here because either 
one may be useful for different problems. 



Parent pointers are on smallest size paths 



In the original Bellman-Ford algorithm, parent pointers are updated only when 
edges are relaxed and a strictly shorter (distance) path is discovered. In particular, 
let life be the possible values assigned to parent pointer tt on the kth iteration of 
Bellman-Ford.* Then 



nfe(s,w) 



{nil} 
0 

{u: Ak{s,v) ^ Ak-i{s,u)+W{u,v)} 



if fc = 0, u = s 
a k — 0, V ^ s 
if fc> 1, Akis,v) 
= Ak-i{s,v) 
otherwise 



Consideration of this update rule gives the following lemma, stating that the parent 
selected is the parent on a shortest (< A;)-hop path having the fewest hops. 



Lemma 5.1. Consider a graph with designated source vertex s and no negative- 
weight cycles. Then u ^ nil G T\k{s,v) if and only if there exists a path p — 
{s,...,u,v) such that w{p) = Ak{s,v), and there is no path s v such that 
w{p') — Ak{s,v) and \p'\ < \p\. Moreover, 11^(5, s) = {nil}. 

Proof. By induction on k. The lemma trivially holds at k — 0. 

If Ak{s,v) = Afc_i(,s,i;), then a shortest (< A:)-hop path from s to t; has size 
at most A: — 1, and Ilk{s,v) = Ilk-i{s,v) satisfies the lemma. If Ak{s,v) < 
Ak-i{s,v), then all shortest (< /c)-hop paths from s to w have size exactly k, 
and the lemma holds given the definition of 11^. Since Ak{s, s) — 0, it follows that 
nfe(s,s) = {nil}. □ 

This lemma is useful for both of the following shortest path tree approaches. 



Computing parent pointers with 3-tuples 

Since the original Bellman-Ford algorithm is correct when relaxing edges in an 
arbitrary order, Lemma 5.1 implies that we can update the parent pointer to be the 
penultimate vertex along any smallest (size) path, as the final edges on all equally 
sized paths are relaxed on the same iteration. Thus, as long as we consider path 
size and weight when performing an update, we do not need the strict inequality 
used in the parent-pointer update rule. 

The goal here is to link the distance and parent-pointer updates by changing 
our operations to work over 3-tuples. These 3-tuples consist of components corre- 
sponding to total path weight, path size, and the penultimate vertex. By cleverly 

An "iteration" here refers to an iteration of the for loop in hnes 5—10 of Algorithm 5.1. 
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redesigning operators for these 3-tuples, we can use the same algebra-based algo- 
rithms (i.e., 6 = dn = doA^ still applies for a good choice of scalar addition and 
multiplication operations). 

We define our scalars as 3-tuples of the form {w,h,TT) E S = (Kqo x N x 
V) U {(00,00,00), (0,0, nil)}, where Koo = K U {00} and N = {1,2,3, . ..}. The 
value (00, 00, 00) corresponds to nonexistence of a path, and the value (0, 0, nil) 
corresponds to a path from a vertex to itself. The first tuple element w € Koo 
corresponds to a path weight. The second tuple element h G N corresponds to 
a path size or number of hops. The third element w G V corresponds to the 
penultimate vertex along a path. 

Given this definition of 3-tuples, setting up the entries in the adjacency matrix 
A is straightforward. In particular, we have 

{(0, 0, nil) ifu = v 

{W{u,v),l,u) iiuy^v,{u,v) G E (5.3) 
(00, 00, 00) if {u, v) ^ E 

Note that all of the tuple values, except the parent u when the edge exists, are im- 
plicit just given an edge weight. Moreover, a natural implementation likely already 
stores u, either implicitly as a dense matrix index, or explicitly as an (adjacency) 
list in a sparse matrix. Thus, when implementing this algorithm, it is not strictly 
necessary to increase the storage used for the adjacency matrix. 

Setting up the initial distance vector do is also straightforward. We have 

... f (0,0, nil) if w = .s 

(00,00,00) otherwise ' 

For our addition operation, we use an operator Imin that is defined to be the 
lexicographic minimum. To perform a lexicographic minimum Imin, compare the 
first tuple part. If they are equal, compare the second part, and, if they are equal, 
the third part. 



lmin{ (wi , /ii , TTi ) , (W2 , /12 , 7r2 ) } 



{wi,hi,ui) a wi < W2, 01 

if wi = W2 and hi < /i2, or 
if wi = W2 and hi = /12 and 

Ui < U2 

, {w2,h2,U2) otherwise 



Without loss of generality, the vertices are number 1, 2, . . . , A'', and nil < t; < 00 
for all V £ V. 

For our multiplication operation, we define a new binary function called -|-rhs- 
This function adds the first two parts of the tuple and retains the third part of the 
tuple from the right-hand-side argument of the operator. We define this operator 

as follows 



(Wl,/ll,7ri) -hrhs (W2,/l2,7r2) 



{Wi +W2,hi+ h2, TT2) if TTl 7^ OO, 712 NIL 

{wi + W2,hi + h2,ni) otherwise 



We introduce the exceptions for tti = oo simply to give us a multiplicative identity 
when this operation is used as multiplication in conjunction with Imin as addition. 
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The exception when 712 = nil is for correctness. Note that unUke regular +, this 
+rhs operation is not commutative. 

Lemma 5.2. The set S = (Roo x N x 1/) U {(00, 00, 00), (0, 0, nil)} under the 
operations Imin.+^hs is a semiring. 

Proof. The Imin function is obviously commutative and associative as it is a natural 
extension of min. Moreover, S is closed under Imin and +rhs- 

The additive identity is (00, 00, 00), and the multiplicative identity is (0, 0, nil). 
Note that having the exception for 00 in +i.hs is necessary to make the additive 
identity ((00, 00, cxd)) multiplied by anything be the additive identity. Having the 
exception for NIL makes (0, 0, nil) act as a multiplicative identity. 

To show associativity of +rhs, consider the multiplication 

(Wl,/ll,7ri) +rhs (w2,/i2,7r2) +rhs (^3, /l3, TTs) 

with any parenthcsization. The result is {wi + W2 + w^, hi + h2 + ft.3,7r'). If any 
TTi = 00, then tt' = cx). Suppose all it i ^ oo. Then the result is tt' = tTj, where i is 
the largest value such that tt^ ^ nil. 

Finally, we show that +ihs (implicitly denoted by ah = a+rhs^) distributes over 
Imin. In particular, let t = (w, /i, tt), ti = (wi, tti), and t2 = {w2,h2,iT2) ■ Then 
we must show left distributivity t{ti\m\nt2) = tt\]mm.tt2 and right distributivity 
(tilmini2)i- Without loss of generality (by commutativity of Imin), suppose that 
lmin{ti,f2} =ti- We consider all cases separately. 

1. Suppose t = (00, 00, 00). Then (oo, oo, (X))(tilminf2) = (oo, oo, oo)ti = (oo, oo, oo) 

\ . 00, oo)tilmin(oo, oo, oo)t2- Similarly for right distributivity. 

2. Suppose t = (0, 0, nil). Then (0, 0, NiL)(tilmint2) = (0, 0, mh)t\ =t\= t\\mva.t2 = 
(0,0,NlL)filmin(0, 0,NiL)t2. Similarly for right distributivity. 

3. Suppose <2 = (oo, oo, oo). Then i(iilmin(oo, oo, oo)) = tti = ttilmin(oo, oo, oo) = 
iiilmint(oo, oo, oo). Similarly for right distributivity. 

4. Suppose ti = ^2 = (0,0, nil). Trivial. 

5. Suppose ti = (0,0, nil) and f,t2 G M x N x I/. Since we suppose Iniin{fi,<2} = 
ti, we implicitly have W2 > 0. Thus, (u;, /i, 7r)((0, 0, NlL)Iniin((y;2, /12, "■2)) = 
(w, /i, 7r)(0, 0, nil) = (w;, /i, 7r)(0, 0, NiL)lmin(w, /i, 7r)(w2, /i2, 7r2). The last step 
follows because w + W2 > w and h + h2 > h+1. Similarly for right distributivity. 

6. Suppose t,ti e R X N X y and t2 = (0,0, nil). Again, we implicitly have 
wi < 0. Then (w, ft,, 7r)((wi , /ii, 7ri)lmin(0, 0, nil)) — {w,h,Tr){wi,hi,'jri) = 
(w, /i, 7r)(wi, fti, 7ri)lmin(w, ft, 7r)(0, 0, nil). The last step follows because w + 
Wi < w. Similarly for right distributivity. 

7. Suppose t,ti,t2 SRxNxV. Then we have (w, ft,7r)((wi, fti, 7ri)lmin(«;2, ft'2,7r2)) = 
{w,h,Tr){wi,hi,TTi) = {w, h,n){wi, hi, ni)lmm{w, h, it) {w2,h2, 772). Similarly for 
right distributivity. □ 
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Finally, we show correctness of our 3-tuple-based algorithm. Recall that the 
goal of this 3-tuple transformation is to retain that 5 « d„ = dA^, where addition 
and multiplication are defined as Imin and +rhs, respectively. More accurately, we 
have dn{v) = (A(s, v), \p\ , u), where p = {s, . . . ,u,v) is a smallest (size) path from 
s to V. Correctness is captured by the following lemma. 

Lemma 5.3. Consider a graph with designated vertex s. Let adjacency matrix 
A and initial distance vector do be set up according to equations (5.3) and (5.4), 
and let dfc = dfe_i Imin. +-r/is A for k > I. If h < k € N is the minimum .size 
(number of hops) of a shortest (< k)-hop path from s to v, then there exists u Cz V 
such that dk{v) = (Afc(s, u), /i, u), u = mui{v : v € Ilk{s,v)}, and there exists an 
h-hop path p — (s, . . . , u, with w(p) — ^k{s, v). If no (< k)-hop path exists, then 
dfc(w) = (cxi, cxD, cxd) . For the source vertex, dfc(s) = (0,0, nil). 

Proof. By induction on k. The lemma trivially holds with fc = 0. 

Let {w, h,u) ~ dk{v) and suppose that a (< fc)-hop path (for k > 0) from s 
to V exists. 

First, we show that w and h are correct. For every Ui ^ v £ V , let {wi, hi, iTi) = 
dk-i{u,). Let (w;',/i',7r') = dk-i{v). By definition, dk{v) ^ lmin„'gydfe_i (u') +i.hs 
A{u\v) = {\m.mu':(u',v)<£E dfc_i (u') +rhs ^(w', w)) Imin dk-i{v). Thus, by nature 
of the Imin, we have {w,h) = IminKw', /i'), lmini{(ti;i +'W(ui,v),hi + 1)}}. The 
inductive assumption gives Wi = A.k-i{s,Ui); we conclude that w — A.k{s,v) since 
the shortest path using < k hops must start with a shortest path using < fc — 1 
hops. Similarly, h is correct since each hi corresponds to the smallest number of 
hops necessary to achieve the shortest {< k — l)-hop path. 

Next, we show that u is correct. Notice that the Imin across all neighbors 
Ui of V can be reduced to the Imin across all neighbors having Wi + W(uj,ti) = 
w = A.k{s,v) and hi + I ~ h, or equivalently (from Lemma 5.1) the Ui g 11^(5,1;). 
The Imin is also affected by {w' , h' , it') = dk-i{v) \i w' = w and h' = h. For any 
neighbor Ui e Ilk{s,v), we have dk-i{ui) +rhs A{ui,v) = {w,h,Ui). For v itself, 
dfe-i(w) +rhs A{v,v) — dfe_i(w). By inductive assumption, tt' = min{u' : u' G 
Ilk-i{s,v)}. Thus, since w' = w implies Yik{s,v) = n/c_i(s,ti), we have that the 
result value u is affected by all and only the members of Wk{s,v). In other words, 
we have reduced our expression to dfc(w) = lminjr(=nfc(s,u) ('f^: ft.,7r). □ 

Lemmas 5.2 and 5.3 together imply that d = dA^ gives a result vector d 
that is consistent with the Bellman-Ford algorithm, regardless of the ordering of 
multiplications. As a result, either the iterative or repeated squaring algorithm still 
works. 

Computing parent pointers at the end 

In this approach, we use the one-extra-multiplication idea of equation (5.2), but we 
modify the scalars to make the equation work for 0-weight cycles. This approach 
still uses tuples, but they are slightly simpler. 

We use 2-tuples of the (w, h) £ Moo x (N U {0, oo}). These tuple values have 
the same meaning as in the 3-tuple approach, but now there is no mention of 
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parent pointers in the tuple. Setting up A and do should be obvious given this new 
definition. 

The Bellman-Ford algorithm using 2-tuples remains the same as that using 
3-tuples, except now we use Imin for addition and the simpler + for multiplication. 
In other words, we have dfc = d^-i lmin.+ A. Correctness follows from Lemma 5.3 
since these operations are exactly the same as those used by the 3-tuple approach, 
except that the third tuple element is ignored. Note that + and Imin are virtually 
identical to + and min, so we now have a commutative semiring. 

To compute the parent pointers, set tt = d„ argmin.+ (A + diag((oo, oo))). 
In the following lemma, we claim that the values in tt(v) are correct for all f 7^ s 
that are reachable from s. 

Lemma 5.4. Consider a graph with designated vertex s and no negative-weight 
cycles. Let d„(v) ~ (A(s, u), /i^) , where hy is the smallest size of a shortest path 
from s to V. Let tt = d„ argmin.+ (A + diag((cxi, 00))). Then for all v ^ V — {s} 
with A.{s,v) < 00, we have 7t{v) G n„(s,w). 

Proof. Consider any vertex v Cz V — {s} that is reachable from s. The product tt — 
d„ argmin.+ + diag(((X), 00))) gives 7r(i;) = argmin„_^j,{(A(s, u) + W(ii, w), + 
1)}. By Lemma 5.1, we have u g n„(s, v) if and only if A(s, u) + W(u, v) — A(s, v) 
and hu ~ hy ~ 1. Thus, Tr(v) = argmin„gjj^(^_j,){(A(s, v), hy)}, giving us the result 
that 7r(i;) G n„(s, w). □ 

For any vertex that is not reachable from s, or for s itself, this approach does 
not yield a correct answer, and we must do some additional cleanup to get tt to 
contain correct values. Alternatively, we can modify argmin to take on a value of 
00 if the operands are all (00,00), and then the only cleanup necessary is setting 
7r(s) ~ NIL. 



The Floyd- Warshall algorithm solves the all-pairs shortest paths problem. Given a 
graph G = (V, E) with edge weights w and no negative- weight cycles, this algorithm 
returns the shortest paths distances A.{u,v) for all u,v ^V. 

Like Bellman-Ford, Floyd- Warshall is a dynamic programming solution. With- 
out loss of generality, vertices are numbered V = {1,2, . . . , N}. This algorithm uses 
Dkiu, v) to represent the shortest path from u to v using only intermediate vertices 
in {1, 2, . . . , k} C V . Thus, we have 



Algorithm 5.3 gives pseudocode for the Floyd- Warshall algorithm. The run- 
ning time is Q{N^) because of the triply nested loops. Although we explicitly store 
every in this version of the algorithm, it is only necessary to store and Dfe_i 
at any given time. Thus, the space usage is 8(A^). 



5.1 .3 Floyd-Warshall 




min{Dfe_i (u, v),'Dk-i{u, k) + Dfe_i (fc, v)). 



k = 0, 
k > 1 
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Algorithm 5.3. Floyd Warshall. 

A standard implementation of the Floyd- Warshall algorithm [Gormen et al. 2001]. 
Floyd-Warshall(1/, E, w) 



1 for w = 1 to 

2 do for = 1 to A 

3 do Do(w,w) = W(u,'y) 

4 if u — V 

5 then Ilo{u, v) — nil 

6 if u ^ V and W(u, v) < oo 

7 then Ilo{u, v) = u 

\> else Uo{u,v) is undefined 

8 for /c = 1 to A 

9 do for w = 1 to A 

10 do for u = 1 to A 

11 do if T>k-i{u,k)+T>k-i{k,v) < T>k-i{u,v) 

12 then Dtf 

13 Ukiu,v)^Uk-i{k,v) 

14 else T)k{u,v) = IDk-iiUjV) 

15 Ilfc = nfc_i(u,i;) 



The table entries Hkiu, v) in this algorithm refer to the penultimate vertex on 
a shortest path from u to w using only vertices {1, 2, . . . , k}. This value should not be 
confused with 11^ given in Section 5.1.2, which is the set of penultimate vertices on 
every shortest (< fc)-hop path from u to v. Note that if we define iTkiv) = 11^(3, v), 
then TTfc comprises a shortest path tree from the source s. The entire set of edges 
selected by 11^ does not necessarily form a tree. 

Algebraic Floyd-Warshall 

To represent Floyd-Warshall as matrix and vector operations, we use the (dense) 
A X A shortest path matrices to store the path weights. We initialize Dq — A, 
or Do(w, v) = W(u, w), with Do(m, u) — 0. 

The recursive definition of D/j(u, v) simply considers two possibilities — either 
the optimal path from u to u using intermediate vertices {1, 2, . . . , A:} uses vertex 
k, or it does not. For our algebraic Floyd-Warshall, we compute the weight of the 
shortest paths using vertex k for all u, t; at once. In particular, the fcth column of , 
denoted Dfc(:, A:), contains the shortest paths from any u to k. Similarly, the fcth row 
Dfc(fc, :) contains the shortest paths from fc to any u. Thus, we simply need to add 
every pair of these values to get the shortest path weights going through fc, which 
we can do by taking T)k{-, k) min.+ Dk{k, :). Note that strictly speaking, there is 
no scalar addition occurring in this outer product, so we do not have to use min. 

Thus the full value for is 

T>k = Dfe_i .min (Dfe_i(:, fc) min.+ Dfe_i(fc, :)) 
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This expression yields the algebraic Floyd-Warshall algorithm given in Algorithm 5.4, 
This algorithm performs TV dense vector multiplications, giving complexity Q{N^). 

Algorithm 5.4. Algebraic Floyd Warshall. 

An algebraic implementation of the Floyd-Warshall algorithm. 

Floyd- Warshall(A) 

1 D = A 

2 for A: = 1 to iV 

3 do D = D .min [D(:, k) min.+ T>{k, :)] 

To compute parent pointers, we can use the same tuples as in Bellman-Ford. 

5.2 Minimum spanning tree 

In the minimum-spanning-tree problem, we are given a weighted, undirected graph 
G = {V, E) with edge weights w : E ^ Moo, where W(u, v) = oo if {u, v) ^ E. For 
simplicity, W(w, v) = 0 for all v E V. 

A spanning tree is a subset T C E of edges forming a "tree" that connects the 
entire graph. A tree is a set of edges forming an acyclic graph. To be a spanning 
tree, it must be that |T| = TV — 1. The weight of a spanning tree is the total weight 
w{T) = X^egt'^C^)- ^ spanning tree is a minimum spanning tree if for all other 
spanning trees T', we have w{T) < w(T'). 

Minimum spanning trees are unique if all edge weights in the graph are unique, 
but otherwise there is no guarantee of uniqueness. Without loss of generality, we 
assume that all edge weights are unique (since they can be made to be unique by 
creating tuples including indices, or equivalently by appending unique lower-order 
bits to the ends of the weights). 

This section considers the minimum-spanning-tree problem. A similar problem 
that is not any "harder" is the problem of finding a minimum spanning forest in an 
unconnected graph, which can be solved directly or by finding minimum spanning 
trees in each connected component of the graph. For the purposes of exposition, 
we assume the graphs to be connected. 

We describe Prim's algorithm for solving the minimum-spanning-tree problem. 
The algebraic variant of Prim's algorithm does not achieve as efhcient a bound as 
the standard algorithm that it replaces. However, a variant of Boruvka's algorithm 
(not presented here) does have matching complexity. 

5.2.1 Prim's 

Prim's algorithm solves the minimum-spanning-tree problem by growing a single 
set S of vertices belonging to the spanning tree. On each iteration, we add the 
"closest" vertex not in S to S. Specifically, we say that an edge (u, v) is a lightest 
edge leaving S if u E S, v ^ S, and W(u,w) — min{(-u', w') : u' £ S,v' ^ S}. 
Suppose that the edge {u, v) is a lightest edge leaving S, with u e 5. Then Prim's 
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algorithm updates S = S U {v}, and T = T U {{u,v)}. This process is repeated 
N — 1 times, at which point T is a spanning tree. 

Prim's is typicaUy implemented using a priority queue. A priority queue Q is 
a data structure that maintains a dynamic set of keyed objects and supports the 
following operations: 

• Insert(Q, x): inserts a keyed object x into the queue (i.e., Q — QU {x}). 

• Q — Build-Queue(xi, a;2, . . . , Xn): performs a batched insert of keyed ob- 
jects xi,X2, . ■ ■ ,Xn into an empty queue Q (i.e., Q — {xi, X2, ■ ■ ■ , 

• Extract-Min((3): removes and returns the element in Q with the smallest 
key (i.e., assuming unique keys, if key{x) = \mny^Q{key{y)} , then Q = Q — 
{x}). 

• Decrease-Key (Q, a;, k): decreases the value of a::'s key to fc, assuming that 
k < key{x). Here, a; is a pointer to the object. 

Several other operations may be supported, but they are not important to Prim's 
algorithm. A naive implementation of a priority queue is an unsorted list. The 
Insert and Decrease-Key operations are trivially 6(1). The Extract-Min 
operation is Q(y) in the worst case to scan the entire list. Fredman and Tarjan's 
[Frednian & Tarjan 1987] fibonacci heap has 0(1) amortized cost for all of these 
operations except Extract-Min, which has O(logiV) amortized cost. 

A standard implementation of Prim's algorithm using a priority queue is given 
in Algorithm 5.5. This algorithm performs one Build-Queue, M Decrease-Key, 
and N Extract-Min operations, yielding a runtime of 0{M + A^log A^) using a 
fibonacci heap. 

Algorithm 5.5. Prim's. 

A standard implementation of the Prim's algorithm [Gormen ct al. 2001]. 

PRIMS(y", E, w) 

1 foreach v £ V 

2 do key{v) — oo 

3 7r(w) = NIL 

4 weight — 0 

5 > choose some arbitrary vertex s £ V 

6 key{s) = 0 

7 Q = Build-Queue(1/) 

8 while g 7^ 0 



9 do M = Extract-Min((5) 

10 weight = weight +'W{Tr{u), u) 

11 foreach v with {u,v) S E 

12 do ii V E Q and W(u,v) < key{v) 

13 then Decrease-Key((3, w, W(m, v)) 

14 7r(w) — u 
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Algebraic Prim's 

We use the (sparse) N x N adjacency matrix A to store edge weights, a 1 x iV 
vector s to indicate membership in the set S, and a 1 x vector d to store the 
weights of the edges leaving the set S. We maintain s and d according to 

,^_Joo a V £ S 
^ ' [0 otherwise 

and 

d(u) = minW(M, v) 

If V ^ S, then d{v) gives the hghtest edge connecting v to S. If w G S, then d(w) — 0. 

Our algebraic version of Prim's is given in Algorithm 5.6. line 7 finds the 
vertex that is closest to S (without being in S). This step performs an argmin 
across an entire vector, which is essentially Q{N) work per iteration. The vector 
addition in line 10 decreases the key of all neighbors of closest vertex u. This 
step may cause Q(N) work in a single iteration, but in total we only update each 
edge once, for a total work of 0{M) across all iterations. The algorithm thus has 
complexity 0(A^), which is caused by the slow argmin of line 7. To understand this 
bound, the total work is O(A^), whereas the conventional Prim's with a priority 
queue achieves 0(M + A log A). This bound is significantly worse for sparse graphs 
but equivalent for very dense graphs. 

Computing the tree. Thus far, we have ignored the spanning-tree edges as we 
do with shortest path trees. Keeping track of edges is relatively simple for Prim's 
algorithm using a similar tuple-based approach. Each entry in the adjacency matrix 
A (and vector d) is a 2-tuple A{u, v) = (W(u, v),u). The only necessary change to 
the algorithm is to keep a 1 x A vector tt that stores the spanning tree for all vertices 
in S. In particular, we modify the algorithm according to Algorithm 5.7. We add 
line 12 to store the edge at the time it is added to the spanning tree (because this 
edge may be destroyed by the update to d in line 13). 

Algorithm 5.6. Algebraic Prim's. 

An algebraic implementation of Prim's algorithm. 

Prims(A) 

1 s = 0 

2 weight = 0 

3 > choose arbitrary vertex £ V to start from 

4 s(1) oo 

5 d = A(l,:) 

6 vi^hile s 7^ oo 

7 do u = argminjs + d} 

8 s(u) — oo 

9 weight = weight -|-d(u) 
10 d = d .min A{u, :) 
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Algorithm 5.7. Algebraic Prim's with tree. 

An algebraic implementation of Prim's algorithm that also computes the tree. 
Prims(A) 

6 TT = NIL 

7 while s 7^ oo 



8 do u = argmin{s + d} 

9 s{u) = oo 

10 (w,p)=d(u) 

11 weight — weight +w 

12 7r(u) — p 

13 d = d .min A(m, :) 



Since the entries in d and A correspond exactly to particular edges, it should 
be obvious that this version of the algorithm correctly returns the spanning tree 
edges as well as the total weight. Note that storing edges here is strikingly simpler 
than in shortest paths of Section 5.1.2. Part of the reason for the simplicity is 
that the d vector here stores just edges, whereas in Section 5.1.2, we store paths. 
Another cause is the structure of the algorithm — we are not operating over a single 
semiring here, so iterations of the algorithm do not associate anyway, so there is no 
need to come up with fancy logic. 
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Chapter 6 

Complex Graph Algorithms 



Eric Robinson* 



Abstract 

This chapter discusses the representation of several complex graph algo- 
rithms as algebraic operations. Even though the underlying algorithms 
already exist, the algebraic representation allows for easily expressible 
efficient algorithms with appropriate matrix constructs. This chapter 
gives algorithms for clustering, vertex betweenness centrality, and edge 
betweenness centrality. 

6.1 Graph clustering 

Graph clustering is the problem of determining natural groups with high connectiv- 
ity in a graph. This can be useful in fields such as machine learning, data mining, 
pattern recognition, image analysis, and bioinformatics. There are numerous meth- 
ods for graph clustering, many of which involve performing random walks through 
the graph. Here, a peer pressure clustering technique [Rcinhardt ct al. 2006] is 
examined. 

6.1.1 Peer pressure clustering 

Peer pressure clustering capitalizes on the fact that given any reasonable cluster 
approximation for the graph, a vertex's cluster assignment will be the same as the 
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cluster assignment of the majority of its neighbors. Consider the graph in Figure 6.1, 
where each vertex except vertex 4 is in the proper cluster. 

If the incoming edges to each vertex in this graph are examined to determine 
from which cluster they originate, this clustering can be considered suboptimal. As 
seen in Figure 6.2, all vertices except vertex 4 have the most incoming edges from 
the cluster they are in. 

In addition, if the clustering is adjusted accordingly by placing vertex 4 in 
the red cluster, the result now shows the clustering to be optimal. Each vertex has 
a majority of incoming edges originating from its own cluster. This is shown in 
Figure 6.3. 

Traditionally, peer pressure clustering is performed by iteratively refining a 
cluster approximation for the graph. Given a cluster approximation, each vertex in 
the graph first votes for all of its neighbors to be in its current cluster. These votes 
are then tallied and a new cluster approximation is formed by moving vertices to 
the cluster for which they obtained the most votes. The algorithm typically ter- 
minates when a fixed point is reached (when two successive cluster approximations 
are identical). 

Algorithm 6.1 shows the recursive definition for peer pressure clustering. This 
algorithm can also be performed in a loop, keeping track of the current and previous 
cluster approximation and terminating when they agree. 

Algorithm 6.1. Peer pressure. Recursive algorithm for clustering vertices.* 

PeerPressure(G = {V, E), d) 

1 for {u, v,w) E 

2 doT{v){C{u)) ^T{v){C{u))+w 

3 for n E V 

4 do Cf{n) ^i-.yj eV : T{n){j) < T{n){i) 

5 if C, Cf 

6 then return Cf 

7 else return PeerPressure(G', C/) 

In this algorithm, the loop at line 1 is responsible for the voting, and the loop at 
line 3 tallies those votes to form a new cluster approximation. It is assumed that 
the structure T is stored as an array of lists, which keeps track of, for each vertex, 
the number of votes that vertex gets for each cluster for which it receives votes. 

Unfortunately, convergence is not guaranteed for unclustered graphs and patho- 
logical cases do exist where the algorithm must make 0(n) recursive calls before a 
repetition. In any well-clustered graph, however, this algorithm will terminate after 
a small number of recursive calls, typically on the order of five. 



V.B. Shah, An Interactive System for Combinatorial Scientific Computing with an Emphasis 
on Programmer Productivity, Ph.D. thesis, Computer Science, University of California, Santa 
Barbara, June 2007. 
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Figure 6.1. Sample graph with vertex 4 clustered improperly. 




Figure 6.2. Sample graph with count of edges from each cluster 
to each vertex. 




Figure 6.3. Sample graph with correct clustering and edge counts. 
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Starting approximation 

In order to get things going with the peer pressure clustering algorithm, an initial 
cluster approximation must be chosen. One strategy for this is to simply perform 
a single round of Luby's maximal independent set algorithm [Luby 1986]. This 
will yield a reasonable cluster approximation assuming that the individual clusters 
in the graph are highly connected. However, this is completely unnecessary. For 
graphs that actually contain clusters, the solution arrived at by the peer pressure 
algorithm is highly independent from the initial cluster approximation. For this 
reason, a naive starting approximation as shown below, where each vertex is in a 
cluster by itself, suffices to start things off 

for V E V 

do Ci{v) — V 

Assuring vertices have equal votes 

In graphs where there is a large discrepancy between the out-degree of vertices, 
such as real-world power law graphs, vertices with a large out-degree will have a 
larger influence on the cluster approximations. These vertices will have more votes 
in each cluster refinement because they have more outgoing edges. This can be 
easily remedied by normalizing the votes of each vertex to one. This can be done 
by summing up the weights of the outgoing edges of each vertex, and then dividing 
those weights by that sum. The code for performing this is shown below 

for e = (u, v,w) E E 

do S{u) = Siu) + w 
for e = (u, v,w) E E 

do w — w/S{u) 

Preserving small clusters and unclustered vertices 

Depending upon the desired result, it may be advantageous to either group vertices 
with a single connection to a large cluster or to keep these vertices separate in clus- 
ters of their own. Typically, a one-vertex cluster makes little sense. However, one 
can view these as simply unclustered vertices. Smaller clusters and single vertices 
tend to get subsumed by larger ones because the vertices in larger clusters tend 
to have a very high-combined out-degree. The majority of these edges go to other 
vertices in the large cluster, but a few may go outward to smaller clusters. If a 
couple of vertices in a large cluster have these outward connections to the same 
small cluster, then that cluster may be pulled into the larger one. 

To remedy this, each vertex's votes can be scaled according to the size of the 
cluster it is in. It is typically not desirable to divide the vertex's vote directly by 
the size of its cluster, as that will lead to a cluster with a single vertex having the 
same voting strength as all the vertices combined in a larger cluster. However, some 
scaling is clearly desired. To do this, rather than scaling directly by the number of 
vertices in the cluster, a strength 0 < p < 1 is chosen, where 0 indicates no scaling 
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Figure 6.4. Sample graph. 



and 1 indicates full scaling. The inverse of the scaling is then raised to this power 
to determine the actual scaling. This can be done as shown below 

for u E V 

do s{C{u)) = s{C{u)) + 1 
for e = (m, v,w) £ E 

do w = w/ s{C{u)Y 

The value for p need not be static. One option is to compute p based on the 
number of current clusters. When there are many clusters, p should be close to 0 as 
scaling is most likely not necessary. As the number of clusters shrinks, p can grow 
to prevent the larger of those clusters from dominating. 

Settling ties 

Line 4 of Algorithm 6.1 does not specify what to do if two clusters tie for the 
maximum number of votes for a vertex. Currently, as the algorithm is shown, the 
vertex would not be put into a cluster. This, however, is not a desirable outcome. 
Instead, one of the clusters with the maximum number of votes for that vertex 
should be chosen. This can be done in a deterministic manner by selecting the 
cluster with the lowest number. This deterministic method also helps to speed the 
algorithm to convergence by having all vertices choose the same "leader" for their 
cluster early in the algorithm. 

Sample calculation 

A clustering example will be demonstrated on the graph shown in Figure 6.4, which 
is the same graph as that shown in Figure 6.3. The vertex numbers are shown here 
for the purposes of tie breakers (which will go to the vertex/cluster with the lowest 
number). The obvious clusters, as with the previous example, are between vertices 
1, 3, 5, 7 and vertices 2, 4, 6, 8. 

The initialization of the clustering algorithm requires computing weights for 
each vertex's outgoing edges based on the overall out-degree of the vertex. In 
addition, each vertex must be placed in a cluster by itself. Both of these steps are 
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shown in Figure 6.5. Note that it is assumed that there is a self-edge to each vertex, 
and the initial cluster number for each vertex corresponds to its vertex number. 

Consider a single pass through the peer pressure clustering algorithm for this 
graph. Vertices 1,5, and 7 will be pulled into cluster 7, as its outgoing edges hold 
the most value (there are the fewest of them). From there, all of the rest of the 
votes will be ties outside of those from vertex 6, which will lose because it has more 
outgoing edges than the rest of the vertices. As a result, vertices 3 and 4 are pulled 
into cluster 1 (which no longer contains vertex 1), vertices 2 and 8 are pulled into 
cluster 2, and finally vertex 6 is pulled into cluster 4 (which no longer contains 
vertex 4). These results are shown in Figure 6.6. 

After the first iteration through the peer pressure clustering algorithm, there 
are only four clusters remaining out of the initial eight. The second iteration reduces 
this number further. Vertices 1, 5, and 7 all have enough votes to remain in cluster 7. 
In addition, vertices 1 and 5 vote for vertex 8 to be in their cluster, resulting in a total 
vote of 0.5, enough to pull vertex 3 into cluster 7, as the only other votes it receives 
are from cluster 2 with a weight 0.25. Vertices 2 and 8 also have enough votes to 
remain in cluster 2. In addition, they both vote for vertex 4 to be in their cluster 
for a total weight of 0.5, beating out the votes from cluster 4 of 0.2 and cluster 7 of 
0.25, and shifting vertex 4 into cluster 2. Finally, vertex 6 receives 0.25 votes from 
cluster 2, 0.25 votes from chister 1, and 0.2 votes from cluster 4. The tie moves 
vertex 6 to the cluster with the lowest number, duster 1, as shown in Figure 6.7. 

One final iteration is required to arrive at the result and reduce the total 
number of clusters from three, as in the previous round, to two. In this iteration, 
vertices 2, 4, and 8 have enough votes from their cluster to remain in cluster 2. 
Vertices 1, 3, 5, and 7 do as well, and they remain in cluster 7. Vertex 6 receives 0.5 
votes from cluster 2 and 0.2 votes from t lustcM 1 (itself), shifting it to cluster 2. This 
iteration is shown in Figure 6.8 and is the final clustered solution for this graph. 
Note that the next pass will keep the same cluster approximation, indicating that 
a fixed point has been reached. 

Space complexity 

The above algorithm has a space complexity of 0{M) due to the cluster tally T. 
T can be assigned at most one entry per edge in the graph because each edge 
represents a vote. Since it is possible that all of these edges represent votes for 
different vertex cluster pairs, T may contain up to M different vote combinations. 

Time complexity 

The voting loop at line 1 must process 0{M) votes, one for each edge in the graph. 
After this, the maximum for each vertex must be obtained by the tallying loop at 
line 3. Because T has a maximum of 0{M) entries, one for each vote made, this 
requires 0{M) operations as well. This leads to 0{M) operations per recursive 
call, and, where p is the number of passes before convergence, a total runtime of 
0{p X M). As noted earlier, p is typically a small constant for well-clustered graphs 
on the order of five. In this case, the total runtime is just 0{M). 
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Figure 6.5. Initial clustering and weights. 
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Figure 6.6. Clustering after first iteration. 
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Figure 6.7. Clustering after second iteration. 
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Figure 6.8. Final clustering. 
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6.1 .2 Matrix formulation 

Where the graph is represented as a weighted adjacency matrix, G = A : R^^^^, 
the clustering algorithm can be performed in the same manner. Let C : B^^^ be 
the cluster approximation, where if c.^ === 1, then vertex j is in cluster i. 

With this representation, voting can be expressed as a simple matrix matrix 
multiplication 

T = CA 

Here T represents a tallying matrix where if ty —= fc, then there are k votes for 
vertex j to be in cluster i. 

Once the votes have been performed, the new clusters need to be selected. 
This can be done with the following operations 

m = T max. 
C = m .== T 

Here the max vote for each vertex in each cluster is found, then the cluster approx- 
imation is set appropriately according to that value. 

From this approximation, the peer pressure algorithm can be formed (see 
Algorithm G.2). Line 2 performs the voting, and lines 3 and 4 tally those votes to 
form a new cluster approximation. 

Algorithm 6.2. Peer pressure matrix formulation. 

PeerPressure(G = A : R^""^, C; : B^><^) 

1 T : M^^^ Cf : B^><^ m : 

2 T = CiA 

3 m = T max. 

4 Cf = m .== T 

5 if C; == Cf 

6 then return Cf 

7 else return PeerPressure(G', Cf ) 

Starting approximation 

As before, an initial approximation must be selected. If each vertex is in a cluster 
by itself, with the cluster number being equal to the vertex number, then Ci = I, 
or the identity matrix. 

Assuring vertices have equal votes 

Normalizing the out-degrees of the vertices in the graph corresponds to normalizing 
the rows of the adjacency matrix. This can be done easily as shown below 

w = A +. 

A — l/w .X A 
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Preserving small clusters and unclustered vertices 

Applying weights to clusters requires computing the size of each cluster, which is a 
sum over the rows of the cluster approximation. It can be performed as follows 

w = C +. 

A = (1/w A p) .X A 

Settling ties 

Settling ties for votes in this clustering algorithm requires selecting the lowest num- 
bered cluster with the highest number of votes. In many linear algebra packages, 
this simply corresponds to a call to max, finding the location of the maximum values 
in each column. Typically, the location corresponds to the first maximum value in 
that column, or the smallest cluster number among those who tie for maximums. 

Space complexity 

Algorithm 6.2 uses the following variables 



Name Type Number of Elements 



A 




0{M) 


T 




0{M) 


Ci 


^NxN 


0{N) 


Cf 




0{N) 


m 




0{N) 



T requires only M entries because each edge casts a vote. There cannot be more 
votes than there are edges in the graph. Given this, the overall space complexity of 
the matrix version of peer pressure clustering is 0(M). 

Time complexity 

Each edge casts a single vote in this algorithm, leading to the voting to require only 
0{M) operations. Tallying those votes also only requires 0{M) operations. Where 
the algorithm requires p passes to complete, the overall time is 0{p x M). As with 
the vertex/edge representation, assuming p is a small constant, the overall time is 
just 0{M). 

6.1.3 Other approaches 

Here the Markov clustering algorithm is examined and compared to the peer pres- 
sure clustering algorithm presented above. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



68 



Chapter 6. Complex Graph Algorithms 



Comparison to Markov clustering 

Another popular clustering algorithm using an adjacency matrix representation 
known as Markov clustering is presented here in Algorithm 6.3. 

Algorithm 6.3. Markov clustering. Recursive algorithm for clustering vertices.* 
MARKOv(Ci : M^^''^,e,i) 

1 Cf ^ Ci' 

2 Cf = Cf .A r 

3 w = Ct +. 

4 Cf = V^r .X Cf 

5 if Ci == Cf 

6 then return Cf 

7 else return MARKOv(Cf, e, z) 

Here, Ci is initialized to be A. The parameters e and i are the expansion and 
inflation parameters, respectively. By tuning these parameters, the algorithm can 
theoretically be made to discover clusters of any coarseness unlike the peer pres- 
sure clustering algorithm, which only discovers fine-grained clusters. However, the 
performance differences between these two algorithms are substantial. 

The tallying phase in peer pressure clustering flattens the cluster approxima- 
tion, reducing the size of subsequent computations. For Markov clustering, there 
is nothing like this. Therefore, C can become very dense during the computation. 
The total space required by Markov clustering is O(iV^). The total time required by 
the algorithm, assuming it takes a small constant number of passes p to converge, 
is 0{px N^). Both of these are larger than the requirements for peer pressure clus- 
tering. In addition, while Markov clustering is guaranteed to converge, it typically 
does so at a much slower rate than peer pressure clustering. Markov clustering can 
require up to p = 20 or p = 30 passes before converging to a solution. 

6.2 Vertex betweenness central ity 

Centrality metrics for vertices in general seek to generate information about the 
importance of vertices in a graph. Betweenness centrality rates this importance on 
the basis of the number of shortest paths a vertex falls on between all other pairs 
of vertices. It is commonly used in fault tolerance and key component analysis. 

6.2.1 History 

The original vertex betweenness centrality algorithm was straightforward. It ran an 
all-pairs shortest paths search over the graph, keeping track of the distances between 
all pairs along with the number of shortest paths between those pairs. After this, 
the centrality for each vertex was computed by looking at all other pairs of vertices 

S. van Dongen, Graph Clustering by Flow Stimulation. Ph.D. thesis, University of Utrecht, 
May 2000. 
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and adding the appropriate value to the centrahty if the vertex in question was 
on a shortest path between the other two vertices. The computation time for this 
method was dominated by the second metric, which had to examine 0{N^) pairs 
for each of the 0{N) points, leading to a runtime of 0{N^). In addition, the path 
lengths between all pairs of vertices corresponded to a dense matrix and required 
O(iV^) storage. 

In order to remedy this problem, a new algorithm [Brandcs 2001] was de- 
veloped that utilized cascading updates to betweenness centrality and performed 
single-source shortest paths searches rather than an all-pairs shortest path search. 
This algorithm reduced the computation time to 0{NM) for unweighted graphs 
and 0{NM + N'^\og{N)) for weighted graphs. It also allowed the computation 
space to remain only 0{N + A/), making it especially preferred when looking at 
sparsely connected graphs. 

6.2.2 Brandes' algorithm 

Here the improved algorithm for vertex betweenness centrality in unweighted graphs 
(see [Br;iiulcs 2001]) is examined. This algorithm centers on performing a single- 
source shortest paths search for every vertex in the graph. During this search, the 
number of shortest paths to all other vertices is recorded, along with the order in 
which the vertices are seen and the shortest path predecessors for every vertex. 
After the shortest paths search has been performed for an individual vertex, the 
betweenness centrality metrics for each vertex are updated in the reverse order from 
which they were discovered during the search. 

Traditional algorithm 

Algorithm 6.4 shows Brandes' algorithm for betweenness centrality using a ver- 
tex/edge set representation. 

The loop on line 10 performs the single-source shortest paths search (through a 
breadth-first search) from an individual vertex. This requires 0{M) time. The loop 
on line 25 updates the centrality metrics. Through all the iterations, it performs at 
most one update per each predecessor on the shortest path for a vertex. There can- 
not be more of these updates than there are edges in the graph, so this operation re- 
quires at most 0{M) time. Over all iterations, the algorithm requires 0{NM) time. 

Also, the only additional storage requirement that can have a size larger than 
0{N) is P. As stated before, there cannot be more predecessors than there are 
edges in the graph, so P's size is limited to 0{M), and the algorithm requires at 
most 0{N + M) additional storage. 

Sample calculation 

Consider the graph shown in Figure 6.9. A simple shortest paths calculation, as 
well as a centrality update, for paths starting at vertex 1 is presented here. 
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Algorithm 6.4. Betweenness centrality. 

BetweennessCentrality(G = {V, E)) 

1 CB[v\^Q,'iv £V 

2 for Vs e F 



3 do 

4 S ^ empty stack 

5 P[w\ ^~ empty list, Vw G 

6 cr[i] ^ 0,Vt e cr[s] 1 

7 ^ -l,Vi e d[s] ^ 0 

8 Q empty queue 

9 enqueue{Q, s) 

10 while ^empty{Q) 

11 do 

12 w -s— dequeue{Q) 

13 push{S,v) 

14 for Vw € neighbors(v) 

15 do 

16 if < 0 

17 then 

18 enqueue{Q,w) 

19 dH ^ d[v] + 1 

20 if d[w] = + 1 

21 then 

22 a[w] ^ a[w] + (t[v] 

23 append{P[w],v) 

24 d[v]^Oyv€V 

25 while ^empty{S) 

26 do 

27 w -i— pop(S) 

28 for t; e P[w] 

29 do5H^<5H + ^x(l + d>]) 

30 if w 7^ s 

31 then Cb [w] ^ Cb [w] + 5[w\ 



The first step in computing the betweenness centrality updates induced by 
paths originating at vertex 1 is to perform a breadth-first search from that vertex, 
keeping track of the number of shortest paths to each vertex as the search progresses. 
It is easy to see that the number of shortest paths to a vertex v at depth d is simply 
the sum of the number of shortest paths to all vertices with edges going to v at 
depth d — I. 

This progression is shown in Figure 6.10. The two interesting steps are seen in 
Figure 6.10(c) and Figure 6.10(e). Joins of multiple predecessors with shortest paths 
occur there. In Figure 6.10(c), three vertices, all with one shortest path, combine to 
form a shortest path count of three for their neighboring vertex. In Figure 6.10(c), 
two vertices, all with three shortest paths, combine to form a shortest path count 
of six for their neighboring vertex. 
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Figure 6.9. Sample graph. 




(e) Step 5 

Figure 6.10. Shortest path steps. 

Procedure for computing the shortest paths in the sample grapli. 
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(e) Step 5 



Figure 6.11. Betweenness centrality updates. 

Steps for computing the betweenness centrality updates in the sample graph. 

Once the shortest paths have been computed, the centrality updates can be- 
gin. The updates are computed in reverse depth order. The update for a vertex 
corresponds to the updates for all its shortest path successors, plus one for the suc- 
cessor itself, multiplied by the ratio of the shortest paths that the edge contributed 
to the successor, which is simply the number of shortest paths to the vertex itself 
divided by the number of shortest paths to its successor. 

This cascading update is shown in Figure 6.11. The updates for the root and 
all end vertices are set to zero immediately. From this, the rest of the updates are 
computed. Figure 6.11(b) shows the first set of update computations. Because there 
are six paths to the final vertex, and only three paths to each of its predecessors, 
each predecessor receives only half of its centrality update plus one. Figure 6.11(c) 
joins the updates from its two successors, leading to a total centrality update of 
three for it. Figure 6.11(d) shows how its three predecessors split this score, leading 
to a centrality update of | for each of them. 

This completes the centrality updates induced by shortest paths originating 
from vertex 1. To verify this answer, the depths of each vertex in the graph from 
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vertex 1 minus one can be summed up. This sum should be equivalent to the sum of 
the centrality updates. As can be seen, vertices 2, 3, and 4 each contribute 1 — 1 = 0 
to the depth sum. Vertex 5 contributes 2 — 1 = 1 to the depth sum. Vertices 6 
and 7 each contribute 3 — 1 = 2 to the depth sum. Finally, vertex 8 contributes 
4 — 1 = 3 to the depth sum. This sum, 2 + 2x2 + 3 = 8 is the same as the sum of 
the centrality updates, | + | + |+ 3 + ^ + ^ = 8. 

Linear algebraic formulation 

In order to take advantage of breadth- first search using matrix vector multiplication, 
it is necessary to be able to perform updates to the parent and path information a 
full depth in the search at a time. In addition, to retain the advantage of a linear 
algebra representation, the second loop, the betweenness centrality updates, should 
also be able to be performed a full depth at a time. Fortunately, both of these are 
possible. 

In the first loop, the number of shortest paths is calculated in the natural way 
using a breadth-first search. In addition, rather than keeping track of shortest path 
parents, it suffices to remember the depth of each vertex in the breadth-first search. 
From this, the shortest path parents for a vertex v at breadth-first search depth d 
can be computed easily as Vu € V : depth(u) = d — 1 and A{u^ v) = 1. 

During the second loop, since the only dependence between the betweenness 
centrality updates are between parent and child, performing the updates a depth 
at a time causes no confiicts. The updates are done by selecting edges going to 
vertices at the current depth that are coming from vertices at the previous depth. 
These edges, which correspond to the betweenness centrality updates for their source 
vertices, are then weighted accordingly and summed up. 

Algorithm 6.5 shows Brandes' algorithm for computing betweenness centrality 
by using linear algebra. The variables used in the algorithm are shown in the 
following table 



Name Type Description 



S 



B 



NxN 



the search, keeps track of the depth at 
which each vertex is seen 
the number of shortest paths to each ver- 
tex 

the fringe, the number of shortest paths 

to vertices at the current depth 

the weights for the BC updates 

the BC score for each vertex 

the BC update for each vertex 

the current root value, or starting vertex 

for which to compute BC updates 

the current depth being examined 



P 



f 



w 



u 



b 



r 




d 
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Algorithm 6.5. Betweenness centrality matrix formulation. 

b = BetweennessCentrality(G' = A : B^^^) 

1 b = 0 

2 for 1< r < TV 



3 do 

4 d^O 

5 S = 0 

6 p = 0, p(r) = 1 

7 f = A(r, :) 

8 while f ^ 0 

9 do 

10 d = d+l 

11 P==P + f 

12 S{d,:)=i 

13 f = fA X ^p 

14 while d>2 

15 do 

16 w = S(d, :) X (1 + u) ^ p 

17 w = Aw 

18 w = w X S{d — 1, :) X p 

19 u = u + w 

20 d = d-l 

21 b = b + u 



Line 8 performs the breadth-first search. After updating the search based on 
the new fringe obtained for the previous level, it selects the outgoing edges of that 
fringe, weighting them by the number of shortest paths that go to their parents and 
summing them up. It then filters out those that go to vertices that have been seen 
previously, resulting in the new fringe for the next level. The loop terminates when 
no new values are seen on the fringe. 

Line 14 performs the betweenness centrality updates. It processes the vertices 
in reverse depth order. First, it computes the weights corresponding to those derived 
from the children values (including filtering out edges that do not go to vertices at 
the current depth). It applies the computed weights to the adjacency matrix and 
sums up the row values. It then applies the weights corresponding to those derived 
from parent values (including filtering out vertices not at the previous depth) and 
adds in the new centrality updates. The loop updates centrality for all parents not 
including the root value. 

Algorithm 6.5 has a small optimization. Lines 16-19 can be combined into a 
single parenthesized linear algebraic operation. This reduction allows the w variable 
to be optimized out during implementation. 

Time complexity 

The breadth-first search performed in the loop on line 8 requires the standard 
0{N + M) operations because each vertex is seen at most once over the course of 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



6.2. Vertex betweenness centrality 



75 



the search, which leads to each outgoing edge being selected at most once in the 
vector matrix multiplication. While the negation operation can take up to 0{N) 
time for each depth in the search, in reality, the operation is filtering out existing 
edges in the fringe and can be performed through subtraction or exclusive or to 
remedy this issue. The fringe over all passes of the search will contain each vertex 
at most once. 

The centrality updates performed in the loop on line 14 start by computing 
the weight vector corresponding to the child weights. This is filtered by the vertices 
seen at the current depth, leading to at most 0{N) operations being performed 
through every loop. This also applies to the parent weights computation and the 
summation in the end. Here, the weights are filtered by the vertices seen at the 
previous depth. The matrix vector multiplication only requires 0{M) time through 
all loops because the vector has an entry for a vertex at most once through all loops. 
This leads to 0{N + M) for the second loop as well. 

Since these two loops are performed 0{N) times, the total time required by 
the algorithm is 0{N^ + NM). 

Space complexity 

The space required by each data structure in Algorithm 6.5 is listed below 



Name 


Space 


S 


0{N) 


P 


0{N) 


f 


0{N) 


w 


0{N) 


b 


0{N) 


u 


0(N) 


r 


0(1) 


d 


0(1) 



While the S matrix has iV^ entries, its storage is only 0{N). This is because 
there are only N nonzero values in S, one per column, where the row of the entry 
corresponds to the depth at which the vertex for that column was seen. Given this, 
the overall additional space used by the algorithm is only 0{N). 

6.2.3 Batch algorithm 

Algorithm 6.5 performs a single-source all-destinations breadth-first search. Rather 
than processing this for each root vertex in a loop, it can be modified to process 
all of the root vertices at once by using matrix matrix multiplications rather than 
matrix vector multiplications. This modification leads to similar performance as 
the original betweenness centrality algorithm, in that it requires space quadratic in 
N; 0{N'^), however, only requires 0{NM) time. 

The Brandes algorithm and its modification are only two extremes of a pa- 
rameterized algorithm for computing the betweenness centrality. This algorithm 
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considers batches of vertices at a time rather than single vertices, as in the Brandes 
algorithm, or all vertices, as in the modified algorithm. 

Linear algebraic formulation 

Let P be a partitioning of the vertices V. For ease of exposition, it is assumed that 
P is selected in such a way that each partition p £ P has the same size, \p\. Where 
the number of partitions is |P|, the following relation holds: 

Algorithm 6.6 shows the vertex batching algorithm. Where |P| = and \p\ — 1, 
it reduces to the Brandes algorithm. Where |P| = 1 and \p\ = N, it reduces to 
the modified matrix algorithm described above. Algorithm 6.6 performs the same 
general operations as Algorithm 6.5. However, rather than performing operations 
on a single set of vertices (a vector), it operates on multiple sets of vertices (a 
matrix) simultaneously. 

Time complexity 

Algorithm 6.6 has the same time complexity as Algorithm 6.5, 0{N^ + NM). It 
performs the exact same operations. However, it batches these operations in the 
form of a matrix matrix rather than matrix vector multiplication. While this does 
not improve the theoretical time, it can improve the actual runtime depending on 
the implementation. 

Space complexity 

The space required by each data structure in Algorithm 6.6 is listed below 



Name 


Space 


S 


0{\p\ X N) 


P 


0{\p\ X N) 


f 


0{\p\ X N) 


w 


0{\p\ X N) 


u 


0{\p\ X N) 


b 


0{N) 


r 


0(1) 


d 


0(1) 



This leads to an overall space usage of 0{\p\ x N). While this is more than 
Algorithm 6.6 when the partition size is greater than one, if partitions are kept 
reasonably small, it may be a win. The increased performance gains from batch 
processing may be substantial enough to warrant the small increase in storage. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



6.2. Vertex betweenness centrality 



77 



Algorithm 6.6. Betweenness centrality batch. 



Var 


Type 


Description 


S 


]gJVx|p|xiV 


the search, keeps track of the depth at 
which each vertex is seen for each starting 

vertex 


P 


^IpIxJV 


the number of shortest paths to each ver- 
tex from each starting vertex 


F 


^IpIxJV 


the fringe, the number of shortest paths 
to vertices at the current depth from each 

starting vertex 


W 




the weights for the BC updates for each 
starting vertex 


B 


]^|p|xJV 


the BC score for each vertex for each 
starting vertex 


U 


]^|p|xiV 


the BC update for each vertex for each 

starting vertex 


r 




the current root values, or starting ver- 
tices for which to compute BC updates 


d 


Z 


the current depth being examined 



b = BetweennessCentrality(G = A : M^^'^, P) 

1 b = 0 

2 for r € P 



3 


do 


4 


d = 0 


5 


S = 0 


6 


P = I(r, :) 


7 


F = A(r,:) 


8 


while F j^O 


9 


do 


10 


d = d + l 


11 


P = P + F 


12 


S(d,:,:) = F 


13 


F = FA X -.P 


14 


while d>2 


15 


do 


16 


W = §(rf,:,:) X 


17 


W = (AW)' 


18 


W = W X S(rf- 


19 


U = U + W 


20 


d = d-l 


21 


b = b + ( +. U) 
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6.2.4 Algorithm for weighted graphs 

While it would be possible to implement the version of this algorithm for weighted 
graphs, which can typically run in 0{N'^ log(iV)) time, it would require 0{N^) time 
using matrix and vector operations. This increase is caused by the fact that the rows 
of the matrix cannot be stored and operated on as a priority queue. The traditional 
algorithm relies on a priority queue storage scheme to reduce the breadth-first search 
cost from to iV^ log(iV) in the runtime. 

6.3 Edge betweenness central ity 

Centrality metrics for edges in general seek to generate information about the im- 
portance of edges in a graph. Betweenness centrality rates this importance based 
on the number of shortest paths an edge falls on between all pairs of vertices. It 
is commonly used in fault tolerance and key component analysis. The algorithm 
is similar to the vertex betweenness centrality algorithm described in Section 6.2. 
Like that algorithm, an approach with cascading updates is generally preferred. 

6.3.1 Brandes' algorithm 

The Brandes' algorithm for edge betweenness centrality in unweighted graphs op- 
erates in much the same way as it did for the vertex formulation [Brandes 2001]. 
The first loop to determine the number of shortest paths to each vertex, in fact, is 
identical. The second loop to perform centrality updates must update edges rather 
than vertices, however. 

Vertex/edge set formulation 

Algorithm 6.7 shows Brandes' algorithm for betweenness centrality using a ver- 
tex/edge set representation. The loop on line 10 performs the single-source shortest 
paths search (through a breadth-first search) from an individual vertex. This re- 
quires 0{M) time. The loop on line 25 updates the centrality metrics. Through all 
the iterations, it performs at most one update per each edge in the graph, so this 
requires at most 0{M) time. Over all iterations, the algorithm requires 0{NM) 
time. 

Also, there are no data structures requiring more than 0{M) storage, the 
storage required for the result. This leads to an overall storage requirement of 
0{M). 

Sample calculation 

The same example as the one used in vertex betweenness centrality is considered 
here, using the graph shown in Figure 6.9. A simple shortest paths calculation, as 
well as a centrality update for paths starting at vertex 1, is presented here. 
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Algorithm 6.7. Edge betweenness centrality. 

EdgeBetweennessCentrality(G' ~ {V-.E)) 

2 for Vs e V" 



3 do 

4 S empty stack 

5 P\w\*r- empty list, Vw G V 

6 (T[i] ^ 0,Vi e cr[s] ^ 1 

7 ^ -1,V< G T/, d[s] ^ 0 

8 Q empty queue 

9 enqueue{Q, s) 

10 while -^empty{Q) 

11 do 

12 w ^ dequeue{Q) 

13 push{S,v) 

14 for Vw G neighbors{v) 

15 do 

16 if d[w] < 0 

17 then 

18 enqueue{Q , w) 

19 ^ + 1 

20 if d[w] = d[v] + 1 

21 then 

22 a[w] ^ a[w] + a[v] 

23 append(P[w],v) 

24 S[v]^oyvev 

25 while -^empty{S) 

26 do 

27 ui -s— pop(S) 

28 for w G P[w] 

29 do 

30 ^[„]^j[„]+^[^]x(AM + i) 

31 CB[iv,w)] ^ CB[iv,w)]+a[v] x + 1) 



The first step in computing the betweenness centraUty updates induced by 
paths originating at vertex 1 is to perform a breadth-first search from that vertex, 
keeping track of the number of shortest paths to each vertex as the search progresses. 
It is easy to see that the number of shortest paths to a vertex v at depth d is simply 
the sum of the number of shortest paths to all vertices with edges going to v at depth 
d—1. This progression is identical to the one for vertex betweenness centrality and 
is shown in Figure 6.10. 

Once the shortest paths have been computed, the centrality updates can begin. 
The updates are computed in reverse depth order. As the updates proceed, each 
vertex also keeps track of the sum of all the updates from edges originating from 
it. The update for an edge corresponds to the number of shortest paths to the 
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(e) Step 5 (f) Step 5 



Figure 6.12. Edge centrality updates. 

Steps for computing the edge centrality updates in sample graph. 



originating vertex times the ratio of all the updates for all the edges originating 
from the edge's destination vertex over the number of shortest paths to destination 
vertex plus one. 

This cascading update is shown in Figure 6.12. The update flow for all end 
vertices is set to zero immediately. From this, the rest of the updates and update 
flows are computed. Figure 6.12(b) shows the first set of update computations. 
Because there are six paths to the final vertex, and only three paths to each of 
its predecessors, each predecessor receives only half of those six paths. They each 
have a single predecessor so the flow through those predecessors is just the edge's 
centrality updates. In Figure 6.12(c), each of the two edges contribute three paths 
to their successor and the end vertex, making the update six for each edge. There 
is a single predecessor for those two edges, so that vertex's update flow is twelve. In 
Figure 6.12(d), the update of twelve is split between each of the three shortest path 
edges flowing in. A shortest path edge is then added in for the current vertex, giving 
each of the edges a score of five. They each have a single predecessor, so they each 
get a flow update of five as well. Finally, in Figure 6.12(c), each of the preceding 
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edges receives one additional path for the subsequent vertex, giving each of those 
edges a score of six. The update flow through the root vertex is then eighteen. 

This completes the centrality updates induced by shortest paths originating 
from vertex 1. Unlike vertex betweenness centrality, the sum of the shortest paths 
is not related to the sum of the edge centrality scores. This is due to the fact that 
the score is not scaled based on the percent of shortest paths the edge lies on, only 
the number of shortest paths. 

Adjacency matrix formulation 

Similar to vertex betweenness centrality, edge betweenness centrality is processed 
a full depth in the search at a time both during the breadth-first search and the 
centrality update loops. The first loop involves a breadth-first search just as in the 
vertex formulation of the problem. Algorithm 6.5. The second loop is modified to 
compute edge rather than vertex centrality updates. 

Algorithm 6.8 shows Brandes' algorithm for computing edge betweenness cen- 
trality using linear algebra. 

Line 10 performs the breadth- first search. After updating the search based on 
the new fringe obtained for the previous level, it selects the outgoing edges of that 
fringe, weighting them by the number of shortest paths that go to their parents, and 
summing them up. It then filters out those that go to vertices that have been seen 
previously. This becomes the new fringe for the next level. The loop terminates 
when no new vertices are on the fringe. 

Line 16 performs the betweenness centrality updates. It processes the edges 
in reverse depth order. First, it computes the weights stemming from the children 
vertices (including filtering out edges that do not go to vertices at the current depth) . 
It applies this to the columns of the adjacency matrix. It then applies the weights 
stemming from parent vertices (including filtering out edges that do not originate 
from the previous depth) to the rows of the matrix. Finally, it adds this update to 
the betweenness centrality scores and computes the vertex flow by summing over 
the rows of the current update. The loop updates centrality for all shortest path 
edges from the current root in the graph. 

Algorithm 6.8 has a small optimization. Lines 18-21 can be combined into a 
single parenthesized linear algebraic operation. This reduction allows the w variable 
to be optimized out during implementation. 

Time complexity 

The breadth-first search performed in the loop on line 10 requires the standard 
0{N + M) operations because each vertex is seen at most once over the course of 
the search, which leads to each outgoing edge being selected at most once in the 
vector matrix multiplication. While the negation operation can take up to 0{N) 
time for each depth in the search, in reality, the operation is filtering out existing 
edges in the fringe and can be performed through subtraction or exclusive or to 
remedy this issue. The fringe over all passes of the search will contain each vertex 
at most once. 
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Algorithm 6.8. Edge betweenness centrality matrix formulation. 







Tl *i c f f 1 iri" 1 o Ti 


S 




the search, keeps track of the depth at which each vertex is seen 


P 




tnc number or shortest paths to each vertex 


f 




the fringe, the number of shortest paths to vertices at the current 






depth 


w 




the column/row weights for the EC updates 


B 




the BC score for each edge 


U 




the BC update for each edge 


V 


Z^ 


the BC flow through the vertices 


r 


z 


the current root value, or starting vertex for which to compute BC 






updates 


d 


z 


the current depth being examined 


D = 


: EdgeBetweennessCentrality(G = ^ : B^^^) 


1 


B = 0 




2 


for 1 < ) 


- < N 


3 


do 




4 




d = 0 


5 




S = 0 


6 




p = 0, p(r) = 1 


7 




U = 0 


8 




V = 0 


9 




f = A(r, :) 


10 




while f 7^ 0 


11 




do 


12 




d^d+1 


13 




P = P + f 


14 




S{d,:) = f 


15 




f = fA X -ip 


16 




while d>2 


17 




do 


18 




w = S((i,:)^px v + S(d,:) 


19 




U = A .X w 


20 




w = S{d - 1, :) X p 


21 




U = w .X [/ 


22 




B = B + U 


23 




v = U +. 


24 




d = d-l 
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The centrality updates performed in the loop on line 16 start by computing 
the weight vector corresponding to the child weights. This is filtered by the vertices 
seen at the current depth, leading to at most 0{N) operations being performed 
through every loop. This also applies to the parent weights computation and the 
summation in the end. Here, the weights are filtered by the vertices seen at the 
previous depth. The matrix vector scaling only requires 0{N + M) time through all 
loops because the vector has an entry for a vertex at most once through all loops. 
This leads to 0{N + M) for the second loop as well. 

Since these two loops are performed 0{N) times, the total time required by 
the algorithm is 0{N^ + NM). 

Space complexity 

The space required by each data structure in Algorithm 6.8 is listed below 



Name 


Space 


S 


0{N) 


P 


0{N) 


f 


0{N) 


w 


0{N) 


B 


0{M) 


U 


0{M) 


V 


0{N) 


r 


0(1) 


d 


0(1) 



While the S matrix has iV^ entries, its storage is only 0{N). This is because 
there are only N nonzero values in S, one per column, where the row of the entry 
corresponds to the depth at which the vertex for that column was seen. In addition, 
both B and U have at most one entry (centrality score) per edge. Given this, the 
overall additional space used by the algorithm is only 0(M). 

6.3.2 Block algorithm 

Much like vertex betweenness centrality, edge betweenness centrality can also be 
performed using blocks of root vertices. Here, rather than matrix vector multipli- 
cation and matrix vector scaling, matrix matrix multiplication and matrix-tensor 
scaling are used. Since support for tensor operations is limited both in our linear 
algebra notation as well as in most mathematical software, an in-depth examina- 
tion of this approach is not included. However, it performs similarly to the batched 
vertex version of the algorithm, requiring extra space 0{\p\ x M), where \p\ is the 
size of a block. It still requires the same 0{N^ + NM) time, though the blocking 
may introduce constant speedups. 
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6.3.3 Algorithm for weighted graphs 

While it would be possible to implement the version of this algorithm for weighted 
graphs, which can typically run in 0{N'^ \og{N)) time, it would require 0{N^) time 
using matrix and vector operations. This increase is caused by the fact that the rows 
of the matrix cannot be stored and operated on as a priority queue. The traditional 
algorithm relies on a priority queue storage scheme to reduce the breadth- first search 
cost from to N"^ \og{N) in the run time. 
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Abstract 

Tensors are a useful tool for representing multi-link graphs, and tensor 

decompositions facilitate a type of link analysis that incorporates all 
link types simultaneously. An adjacency tensor is formed by stacking 
the adjacency matrix for each link type to form a three-way array. The 
CANDECOMP/PARAFAC (CP) tensor decomposition provides infor- 
mation about adjacency tensors of multi-link graphs analogous to that 
produced for adjacency matrices of single-link graphs using the singular 
value decomposition (SVD). The CP tensor decomposition generates fea- 
ture vectors that incorporate all linkages simultaneously for each node in 
a multi-link graph. Feature vectors can be used to analyze bibliometric 
data in a variety of ways, for example, to analyze five years of pub- 
lication data from journals published by the Society for Industrial and 
Applied Mathematics (SIAM). Experiments presented include analyzing 
a body of work, distinguishing between papers written by different au- 
thors with the same name, and predicting the journal in which a paper 
is published. 
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7.1 Introduction 

Multi-link graphs, i.e., graphs with multiple link types, are challenging to analyze, 
yet such data are ubiquitous. For example, Adamic and Adar [Adamic & Adar 2UU5] 
analyzed a social network where nodes are connected by organizational structure, 
i.e., each employee is connected to his or her boss, and also by direct email commu- 
nication. Social networks clearly have many types of links — familial, communication 
(phone, email, etc.), organizational, geographical, etc. 

Our overarching goals are to analyze data with multiple link types and to de- 
rive feature vectors for each individual node (or data object). As a motivating ex- 
ample, we use journal publication data — specifically considering several of the many 
ways that two papers may be linked. The analysis is applied to five years of journal 
publication data from eleven journals and a set of conference proceedings published 
by the Society for Industrial and Applied Mathematics (SIAM). The nodes represent 
published papers. Explicit, directed links exist whenever one paper cites another. 
Undirected similarity links are derived based on title, abstract, keyword, and author- 
ship. Historically, bibliometric researchers have focused solely on citation analysis 
or text analysis, but not both simultaneously. Though this work focuses on the anal- 
ysis of publication data, the techniques are applicable to a wide range of tasks, such 
as higher order web link graph analysis [Kolda & Badcr 2006, Kolda ct al. 2005]. 

Link analysis typically focuses on a single link type. For example, both 
PageRank [Brui Page 1998] and HITS [KIcinbcrg 1999] consider the structure 
of the web and decompose the adjacency matrix of a graph representing the hyper- 
link structure. Instead of decomposing an adjacency matrix that represents a single 
matrix, our approach is to decompose an adjacency tensor that represents multiple 
link types. 

A tensor is a multidimensional, or iV-way, array. For multiple linkages, a 
three-way array can be used, where each two-dimensional frontal slice represents 
the adjacency matrix for a single link type. If there are TV nodes and K link types, 
then the data can be represented as a three-way tensor of size N x N x K where 
the {i,j,k) entry is nonzero if node i is connected to node j by link type k. In 
the example of Adamic and Adar [Adamic & Adar 2005] discussed above, there are 
two links types: organization connections versus email communication connections. 
For bibliometric data, the five different link types mentioned above correspond to 
(frontal) slices in the tensor; see Figure 7.1. 

The CANDECOMP/PARAFAC (CP) tensor decomposition (see, for instance, 
[Carroll ,*v' Chang 1970, Harslmiai! 1970]) is a higher order analog of the matrix sin- 
gular value decomposition (SVD). The CP decomposition applied to the adjacency 
tensor of a multi-link graph leads to the following types of analysis. 

• The CP decomposition reveals "communities" within the data and how they 
are connected. For example, a particular factor may be connected primarily by title 
similarity while another may depend mostly on citations. 

• The CP decomposition also generates feature vectors for the nodes in the 
graph, which can be compared directly to get a similarity score that combines the 
multiple linkage types. 
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■< 3C(:, :, 5) = citation 

X(:, 4) = author similarity 

— :, 3) = keyword similarity 
■-%{:, -.,2) = title similarity 
■%{:,:, 1) — abstract similarity 

Frontal Slices %{■., :, k) 

Figure 7.1. Tensor slices. 

Slices of a third-order tensor representing a multi-link graph. 

• The average of a set of feature vectors represents a body of work, e.g., by a 
given author, and can be used to find the most similar papers in the larger collection. 

• The feature vectors can be used for disambiguation. In this case, the feature 
vectors associated with the body of work for two or more authors indicate whether 
they are the same authors or not. For example, is H. Simon the same as H. S. 
Simon? 

• By inputting the feature vectors to a supervised learning method (decision 
trees and ensembles), the publication journal for each paper can be predicted. 

This chapter is organized as follows. A description of the CP tensor decompo- 
sition and how to compute it is provided in Section 7.2. We discuss the properties 
of the data and how they are represented as a sparse tensor in Section 7.3. Numer- 
ical results are provided in Section 7.4. Related work is discussed in Section 7.5. 
Conclusions and ideas for future work are discussed in Section 7.6. 

7.2 Tensors and the CANDECOMP/PARAFAC 
decomposition 

This section provides a brief introduction to tensors and the CP tensor decom- 
position. For a survey of tensors and their decompositions, see [Kolda t^c Bader 
2009]. 

7.2.1 Notation 

Scalars are denoted by lowercase letters, e.g., c. Vectors are denoted by boldface 
lowercase letters, e.g., v. The ith entry of v is denoted by v(i). Matrices are 
denoted by boldface capital letters, e.g., A. The jth column of A is denoted by 
A(:,j) and element by A(i,j). Tensors (i.e., N-way arrays) are denoted by 

boldface Euler script letters, e.g., X. Element {i,j,k) of a third-order tensor % 
is denoted by X{i,j,k). The fcth frontal slice of a three-way tensor is denoted by 
X{:, :, fc); see Figure 7.1. 
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7.2.2 Vector and matrix preliminaries 

The symbol (8) denotes the Kronecker product of vectors; for example 

X = a (g) b ^ x(^) = a(i)b(j) 

where i ^ j + {i - 1)( J) for all 1 < i < /, 1 < j < J 

This is a special case of the Kronecker product of matrices. 

The symbol .* denotes the Hadamard matrix product. This is the element-wise 
product of two matrices of the same size. 

The symbol 0 denotes the Khatri-Rao product (or column wise Kronecker 
product) of two matrices [Smilde ct al. 2004]. For example, let A G K-^^-^ and 
B e R-'""^. Then 

A0B = [A(:,l) ®B(:,1) A(:, 2) 0 B(:, 2) ••• A{:, K) (E}B{:, K)] 
is a matrix of size (/J) x K. 

7.2.3 Tensor preliminaries 

The norm of a tensor is given by the square root of the sum of the squares of all its 
elements; i.e., for a tensor X of size I x J x K 

i=i j=i k=i 

This is the higher order analog of the Frobenius matrix norm. 

The symbol o denotes the outer product of vectors. For example, let a e M-^, 
b e M•^ c e R^. Then 

X = aoboc j, k) ~ SL{i)h{j)c{k) 

for all l<i<I,l<j<J, l<k<K 

A rank-one tensor is a tensor that can be written as the outer product of vectors. For 
A e M^, A e K^""-^, B e M'^^-", and C e M^^-^, the Kruskal operator [Kolda 2006] 
denotes a sum of rank-one tensors 

R 

[A; A, B, CI ^ ^ A(r) A(:, r) o B(:, r) o C(:, r) e W^'^^ 

r=l 

If A is a vector of ones, then |A, B, C] is used as shorthand. 

Matricization, also known as unfolding or flattening, is the process of reorder- 
ing the elements of an iV-way array into a matrix; in particular, the mode-n matri- 
cization of a tensor X is denoted by X(„); see, e.g., [Kolda 2006]. For a three-way 
tensor X e R-^^''^-'^, the mode-n unfoldings are defined as follows 

X(i)(i,p)=X(z,j,fc) wherep- j + (fc-l)(J) (7.1) 

X(2)(i,p)-X(z,j,A:) wherep = z + (fc-l)(/) (7.2) 

X(3)(fc,p)=X(*,j,fc) wherep-z + (j -!)(/) (7.3) 
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C(:,2) 



B(:.l) 



B(:,2) 



A(:,l) 



I 



B(:,i?) 



A(:,2) 



A{:,-R) 



Figure 7.2. CP decomposition. 

Approximates a tensor by a sum of rank-one factors. 



7.2.4 The CP tensor decomposition 

The CP decomposition, first proposed by Hitchcock [Hitchcock 1927] and later 
rediscovered simultaneously by Carroll and Chang [Carroll and Chang 197( ] and 
Harshman [Harsliniaii 1970], is a higher order analog of the matrix SVD. It should 
not be confused with the Tucker decomposition [Tucker 1966], a different higher 
order analog of the SVD. 

CP decomposes a tensor into a sum of rank-one tensors. Let % he a tensor of 
size I X J X K . A CP decomposition with R factors approximates the tensor X as 

R 

X«^A(:,r)oB(:,r)oC(:,r)= IA,B,C1 

r=l 

where A G R^^^-^, B e R-^><^, and C e R^>^«. The matrices A, B, and C are 
called the component matrices. Figure 7.2 illustrates the decomposition. 

It is useful to normalize the columns of the matrices A, B, and C to length 
one and rewrite the CP decomposition as 

R 

3C « ^ \{r) A(:, r) o B(:, r) o C(:, r) = [A ; A, B, C] 

r=l 

where A € In contrast to the solution provided by the SVD, the factor matrices 
A, B, and C do not have orthonormal columns [Kolda 2001, Kolda c^- Badcr 2009]. 

Each rank-one factor, A(r) A(:, r) o B(:, r) o C(:, r), represents a "community" 
within the data; see Section 7.4.1. The number of factors in the approximation, 
R, should loosely reflect the number of communities in the data. Often some 
experimentation is required to determine the most useful value of R. 



7.2.5 CP-ALS algorithm 

A common approach to fitting a CP decomposition is the ALS (alternating least 
squares) algorithm [Carroll & Chang 1970, Harshman 1970]; see also, [Tomasi 2006, 
Fabcr et al. 2003, Tomasi & Bro 2006]. At each inner iteration, the CP-ALS 
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algorithm solves for one-component matrix wliile liolding tfie others fixed. For 
example, it solves for the matrix C when A and B are fixed, i.e.. 



min||X-IA,B,Cl|| (7.4) 



In this case, A is omitted because it will just be absorbed into the lengths of the 
columns of C when the computation is complete. Equation (7.4) can be rewritten 
as a matrix problem (see, e.g., [Smildc ct al. 2004]) 



mm 
c 



X(3)-C(B0A)^ (7.5) 



Here X(3) is the mode-3 matricization or unfolding from equation 7.3. 

Solving this problem makes use of the pseudoinverse of a Khatri-Rao product, 
given by 

(B 0 A)t = ((BB) . * (A^A))^ (B 0 A)^ 



Note that only the pseudoinverse of an i? x i? matrix needs to be calculated rather 
than that of an IJ x R matrix [Smildc et al. 2004]. 

The optimal C is the least squares solution to equation (7.5) 



X(3) 



(B 0 A)1 ^ = X(3) (B 0 A) ( (B^B) . * ( A^ A) ) ^ 



which can be computed efficiently thanks to the properties of the Khatri-Rao prod- 
uct. The other component matrices can be computed in an analogous fashion using 
mode-1 and mode-2 matricizations of % in solving for A and B, respectively. 

It is generally efficient to initialize the ALS algorithm with the R leading 
eigenvectors of X(„)X|jj-j for the nth component matrix as long as the rtth dimension 
of DC is at least as big as R; see, e.g., [Kokla Bader 2009]. Otherwise, random 
initialization can be used. Only two of the three initial matrices need to be computed 
since the other is solved for in the ffist step. The CP-ALS algorithm is presented 
in Algorithm 7.1. 

Algorithm 7.1. CP-ALS. 

CP decomposition via an alternating least squares. DC is a tensor of size I x J x K, 
i? > 0 is the desired number of factors in the decomposition, M > 0 is the maximum 
number of iterations to perform, and e > 0 is the stopping tolerance. 
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CP-ALS (X, R, M, e) 

1 TO = 0 

2 A — R principal eigenvectors of X(i)X|-^ 

3 B — R principal eigenvectors of X(2)X|^2' 

4 repeat 



5 m — m + 1 

6 C = X(3)(B0A)((BTb).*(ATa)) 

7 Normalize columns of C to length 1 

8 B = X(2)(C0A)((CTc).*(ATa)) 

9 Normalize columns of B to length 1 

10 A = X(i)(C0B)((CTc).*(BTb)) 

11 Store column norms of A in A and 
normalize columns of A to length 1 

12 until m > Af or II X- |A ;A,B,C| II < e 



13 return A e ; A e R^"""^ ; B e R''"""^ ; C G M^^-" 
such that w [[A ; A, B, C]| 

In the discussion that follows, A denotes the R x R diagonal matrix whose 
diagonal is A. 

All computations were performed using the Tensor Toolbox for Matlab (see 
[Bader & Kolda 2006, Bader & Kolda 2007]), which was appropriate because of its 
ability to handle large-scale, sparse tensors. 

7.3 Data 

The data consist of publication metadata from eleven SIAM journals as well as 
SIAM proceedings for the period 1999-2004. There are 5022 articles; the number 
of articles per publication is shown in Table 7.1. The names of the journals used 
throughout this paper are their ISI abbreviations and "SIAM PROC" is used to 
indicate the proceedings. 

7.3.1 Data as a tensor 

The data are represented as an TV x x if tensor where N — 5022 is the number of 
documents and if = 5 is the number of link types. The five link types are described 
below; see also Figure 7.1. 

(1) The first slice {%{:, ■.,!)) represents abstract similarity; i.e., is 
the cosine similarity of the abstracts for documents i and j. The Text to Ma- 
trix Generator (TMG) v2.0 [Zeimpekis & Gallopoulos 2006] was used to generate a 
term-document matrix, T. All words appearing on the default TMG stopword list 
as well as words starting with a number were removed. The matrix was weighted 
using term frequency and inverse document frequency local and global weightings 

*http:/ /www. isiknowledge.com/. 
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Table 7.1. SIAM publications. 

Names of the SIAM publications along with the number of articles of 
each used as data for the experiments. 



Journal Name 


Articles 


SIAM J APPL DYN SYST 


32 


SIAM J APPL MATH 


548 


SIAM J COMPUT 


540 


SIAM J CONTROL OPTIM 


577 


SIAM J DISCRETE MATH 


260 


SIAM J MATH ANAL 


420 


SIAM J ALVTRIX AKAL APPL 


423 


SIAM J NUMER ANAL 


611 


SIAM J OPTIM 


344 


SIAM J SCI COMPUT 


656 


SIAM PROC 


469 


SIAM REV 


142 



(tf.idf); this means that 

T{t,j) = f,,\og,{N/N,) 

where /y is the frequency of term i in document j and Aj is the number of docu- 
ments that term i appears in. Each column of T is normalized to length one (for 
cosine scores). Finally 

X(:,:,l) = T"^T 

Because they are cosine scores, all are in the range [0, 1]. In order to sparsify the 
slice, only scores greater than 0.2 (chosen heuristically to reduce the total number 
of nonzcros in all three text similarity slices to approximately 250,000) are retained. 

(2) The second slice {0C{:, :, 2)) represents title similarity; i.e., OC{i,j, 2) is the 
cosine similarity of the titles for documents i and j. It is computed in the same 
manner as the abstract similarity slice. 

(3) The third slice (X(:,:,3)) represents author-supplied keyword similarity; 
i.e., 0C{i,j,3) is the cosine similarity of the keywords for documents i and j. It is 
computed in the same manner as the abstract similarity slice. 

(4) The fourth slice (X(:,:,4)) represents author similarity; i.e., X(i,j, 4) is 
the similarity of the authors for documents i and j. It is computed as follows. Let 
W be the author-document matrix such that 

.N I 1/ \/Mj if author i wrote document 1, 
1 0 otherwise 

where Mj is the number of authors for document j. Then 

X(:,:,4) = W^W 
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(5) The fifth slice :,5)) represents citation information; i.e., 

1 2 if document i cites document j, 
X[i,j, 5) = < 

I 0 otherwise 

For this document coUection, a weight of 2 was chosen heuristically so that the 
overall slice weight (i.e., the sum of all the entries in X(:, :, k), see Table 7.3) would 
not be too small relative to the other slices. The interpretation is that there are 
relatively few connections in this slice, but each citation connection indicates a 
strong connection. In future work, we would like to consider less ad hoc ways of 
determining the value for citation links. 

Each slice is an adjacency matrix of a particular graph. The first four slices are 
symmetric and correspond to undirected graphs; the fifth slice is asymmetric and 
corresponds to a directed graph. These graphs can be combined into a multi-link 
graph and a corresponding tensor representation since they are all on the same set 
of nodes. 

These choices for link types are examples of what can be done — many other 
choices are possible. For instance, asymmetric similarity weights are an option; e.g., 
if document « is a subset of document j, the measure might say that document i is 
very similar to document j, but document j is not so similar to document i. Other 
symmetric measures include co-citation or co-publication in the same journal. 



7.3.2 Quantitative measurements on the data 

Table 7.2 shows overall statistics on the data set. Note that some of the documents 
in this data set have empty titles, abstracts, or keywords; the averages shown in 
the table are not adjusted for the lack of data for those documents. Recall that 
Table 7.1 shows the number of articles per journal. In Table 7.2, the citations 
are counted only when both articles are in the data set and reflect the number of 
citations from each article. The maximum citations to a single article is 15. 

Table 7.3 shows the number of nonzero entries and the sums of the entries for 
each slice. The text similarity slices (fc = 1,2,3) have large numbers of nonzeros 
but low average values, the author similarity slice has few nonzeros but a higher 
average value, and the citation slice has the fewest nonzeros but all values are equal 
to 2. 



7.4 Numerical results 

The results use a CP decomposition of the data tensor X G R^'>^^'^^ 

X« [A;A,B,C1 

where A e M^, A, B e R^^-", and C e M^^-". Using i? = 30 factors worked well 
for the experiments and is the default value unless otherwise noted. 
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Table 7.2. SIAM journal characteristics. 

Characteristics of the SIAM journal and proceedings data (5022 docu- 
ments in total). 





Total in 
Collection 


Per Document 


Average 


Maximum 


Unique terms 


16617 


148.32 


831 


abstracts 


15752 


128.06 


802 


titles 


5164 


10.16 


33 


keywords 


5248 


10.10 


40 


Authors 


6891 


2.19 


13 


Citations (within collection) 


2659 


0.53 


12 



Table 7.3. SIAM journal tensors. 

Characteristics of the tensor representation of the SIAM journal and 
proceedings data. 



Slice (k) 


Description 


Nonzeros 




1 


Abstract Similarity 


28476 


7695.28 


2 


Title Similarity 


120236 


33285.79 


3 


Keyword Similarity 


115412 


16201.85 


4 


Author Similarity 


16460 


8027.46 


5 


Citation 


2659 


5318.00 



7.4.1 Community identification 

The rank-one CP factors (see Figure 7.2) reveal communities within the data. The 
largest entries for the vectors in each factor 

(A(:,r),B(:,r),C(:,r)) 

correspond to interlinked entries in the data. For the rth factor, high-scoring nodes 
in A(:,r) are connected to high-scoring nodes in B(:,r) with the high-scoring link 
types in C(:, r). Recall that the fifth link type, representing citations, is asymmetric; 
when that link type scores high in C(:, r), then the highest-scoring nodes in A(:, r) 
can be thought of as papers that cite the highest-scoring nodes in B(:, r). 

For example, consider the first factor (r = 1). The link scores from C(:, 1) 
are shown in Table 7.4. Title and keyword similarities are strongest. In fact, the 
top three link types are based on text similarity and so are symmetric. Therefore, 
it is no surprise that the highest-scoring nodes in A(:, 1) and B(:, 1), also shown 
in Table 7.4, are nearly identical. This community is related primarily by text 
similarity and is about the topic "conservation laws." 

On the other hand, the tenth factor {r — 10) has citation as the dominant link 
type; see Table 7.5. Citation links are asymmetric, so the highest-scoring nodes in 
A(:, 10) and B(:, 10) are not the same. This is a community that is linked primarily 
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Table 7.4. First community in CP decomposition. 

Community corresponding to the first factor (r = 1) of the CP tensor 
decomposition with R = 30 factors. 



Link scores in C(:, 1) 


Score 


Link Type 


0.95 


Title Similarity 


0.28 


Keyword Similarity 


0.07 


Abstract Similarity 


0.06 


Citation 


0.06 


Author Similarity 



Paper node scores in A(:, 1) (top 10) 



Score 


Title 


0.18 


Oil the boundary control of systems of conservation laws 


0.17 


On stability of conservation laws 


0.16 


Two a posteriori error estimates for ID scalar conservation laws 


0.16 


A free boundary problem for scalar conservation laws 


0.15 


Convergence of SPH method for scalar nonlinear conservation laws 


0.15 


Adaptive discontinuous Galerkin finite element methods for nonlinear hyperbolic . . . 


0.15 


High-order central schemes for hyperbolic systems of conservation laws 


0.15 


Adaptive mesh methods for one- and two-dimensional hyperbolic conservation laws 


Paper node scores in B(:, 1) (top 10) 


Score 


Title 


0.18 


On the boundary control of systems of conservation laws 


0.18 


On stability of conservation laws 


0.16 


Two a posteriori error estimates for one-dimensional scalar conservation laws 


0.16 


A free boundary problem for scalar conservation laws 


0.16 


Adaptive discontinuous Galerkin finite element methods for nonlinear hyperbolic . . . 


0.16 


Convergence of SPH method for scalar nonlinear conservation laws 


0.15 


Adaptive mesh methods for one- and two-dimensional hyperbolic conservation laws 


0.14 


High-order central schemes for hyperbolic systems of conservation laws 



because the high-scoring papers in A(:, 10) cite the high-scoring papers in B(:, 10). 
The topic of this community is "preconditioning," though the third paper in B(:, 10) 
is not about preconditioning directly but rather a graph technique that can be used 
by preconditioners — that is why it is on the "cited" side. 

The choice to have symmetric or asymmetric connections affects the inter- 
pretation of the CP model. In this case, the tensor has four symmetric slices 
and one asymmetric slice. If all of the slices were symmetric, then this would 
be a special case of the CP decomposition called the INDSCAL decomposition 
[Carroll & Chang 1970] where A = B. In related work, Selee et al. [Selee ct al. 
2007] have investigated this situation. 

7.4.2 Latent document similarity 

The CP component matrices A and B provide latent representations (i.e., feature 
vectors) for each document node. These feature vectors can, in turn, be used to 
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Table 7.5. Tenth community in CP decomposition. 

Community corresponding to the tenth factor (r = 10) of the CP tensor 
decomposition with i? = 30 factors. 



Link scores in C(:, 10) 


Score 


Link Type 


0.96 


Citation 


0.19 


AuthorSim 


0.16 


TitlcSim 


0.10 


KeywordSim 


0.06 


AbstractSim 



Paper node scores in A(:, 10) (top 10) 



Score 


Title 


0.36 


Multircsolution approximate inverse preconditioners 


0.20 


Preconditioning highly indefinite and nonsymmetric matrices 


0.16 


A factored approximate inverse preconditioner with pivoting 


0.16 


On two variants of an algebraic wavelet preconditioner 


0.14 


A robust and efficient ILU that incorporates the growth of the inverse triangular factors 


0.11 


An algebraic multilevel multigraph algorithm 


0.11 


On algorithms for permuting large entries to the diagonal of a sparse matrix 


0.11 


Preconditioning sparse nonsymmetric linear systems with the Sherman-Morrison formula 


Paper node scores in B(:, 10) (top 10) 


Score 


Title 


0.27 


Ordering anisotropy and factored sparse approximate inverses 


0.25 


Robust approximate inverse preconditioning for the conjugate gradient method 


0.23 


A fast and high-quality multilevel scheme for partitioning irregular graphs 


0.20 


Orderings for factorized sparse approximate inverse preconditioners 


0.19 


The design and use of algorithms for permuting large entries to the diagonal of . . . 


0.17 


BILUM: Block versions of multielimination and multilevel ILU preconditioner . . . 


0.16 


Orderings for incomplete factorization preconditioning of nonsymmetric problems 


0.15 


Preconditioning highly indefinite and nonsymmetric matrices 



compute document similarity scores inclusive of text, authorship, and citations. 
Since there are two applicable component matrices. A, B, or some combination can 
be used. For example 

S = -AA^ + -BB^ (7.6) 

Here S is an TV x iV similarity matrix where the similarity for documents i and j is 
given by S{i,j). 

It may also be desirable to incorporate A, e.g., 

S = -AAA'^ + -BAB^ 
2 2 

This issue is reminiscent of the choice facing users of latent semantic indexing (LSI) 
[Duniais ct al. 198S] which uses the SVD of a term-document matrix, producing 
term and document matrices. In LSI, there is a choice of how to use the diagonal 
scaling for the queries and comparisons [Berry et al. 1995]. 
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Table 7.6. Articles similar to Link Analysis .... 

Comparison of most similar articles to Link Analysis: Hubs and Author- 
ities on the World Wide Web using different numbers of factors in the 
CP decomposition. 



R = 10 


Score 


Title 


0.000079 


Ordering anisotropy and factored sparse approximate inverses 


0.000079 


Robust approximate inverse preconditioning for the conjugate gradient method 


0.000077 


An interior point algorithm for large-scale nonlinear programming 


0.000073 


Primal-dual interior-point methods for semidefinite programming in finite precision 


0.000068 


Some new search directions for primal-dual interior point methods in semidefinite . . . 


0.000068 


A fast and high-quality multilevel scheme for partitioning irregular graphs 


0.000067 


Reoptimization with the primal-dual interior point method 


0.000065 


Supcrlinear convergence of primal-dual interior point algorithms for nonlinear . . . 


0.000064 


A robust primal-dual interior-point algorithm for nonlinear programs 


0.000063 


Ordcrings for factorized sparse approximate inverse prcconditioners 


R = 30 


Score 


Title 


0.000563 


Skip graphs 


0.000356 


Random lifts of graphs 


0.000354 


A fast and high-quality multilevel scheme for partitioning irregular graphs 


0.000322 


The minimum all-ones problem for trees 


0.000306 


Rankings of directed graphs 


0.000295 


Squarish k-d trees 


0.000284 


Finding the fc-shortest paths 


0.000276 


On floor-plan of plane graphs 


0.000275 


1-Hyperbolic graphs 


0.000269 


Median graphs and triangle-free graphs 


As 


an example of how these similarity measures can be used, consider the 



paper Link analysis: Hubs and authorities on the World Wide Web by Ding et al., 
which presents an analysis of an algorithm for web graph link analysis. Table 7.6 
shows the most similar articles to this paper based on equation (7.6) for two different 
CP decompositions with R—IQ and i? = 30 factors. In the i? = 10 case, the results 
are not very good because the "most similar" papers include several papers on 
interior point methods that are not related. The results for i? = 30 are all focused 
on graphs and are therefore related. Observe that there is also a big difference in the 
magnitude of the similarity scores in the two different cases. This example illustrates 
that, just as with LSI, choosing the number of factors of the approximation (i?) is 
heuristic and affects the similarity scores. 

In the next section, feature vectors from the CP factors are combined to 
represent a body of work. 

7.4.3 Analyzing a body of work via centroids 

Finding documents similar to a body of work may be useful in a literature search 
or in finding other authors working in a given area. This subsection and the next 
discuss two sets of experiments using centroids, corresponding to a term or an 
author, respectively, to analyze a body of work. 
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Consider finding collections of articles containing a particular term (or plirase). 
All articles containing the term in either the title, abstract, or keywords are iden- 
tified and then the centroids and are computed using the columns of the 
matrices A and B, respectively, for the identified articles. The similarity scores for 
all documents to the body of work are then computed as 

s = ^Ag^ + ^Bg^ (7.7) 

Consequently, s(i) is the similarity of the ith document to the centroid. 

Table 7.7 shows the results of a search on the term "GMRES," which is an 
iterative method for solving linear systems. The table lists the top-scoring docu- 
ments using a combination of matrices A and B. In order not to overemphasize the 
papers that cite many of the papers about GMRES (i.e., using only the components 
from A) or those which are most cited (i.e., using only the components from B), 
combining the two sets of scores takes into account the content of the papers (i.e., 
abstracts, titles, and keywords) as an average of these two extremes. Thus, the 
average scores result in a more balanced look at papers about GMRES. 

Similarly, centroids were used to analyze a body of work associated with a 
particular author. All of the articles written by an author were used to generate 
a centroid and similarity score vector as above. Table 7.8 shows the most similar 
papers to the articles written by V. Kumar, a researcher who focuses on several 
research areas, including graph analysis. In these ten articles in the table, only 
three papers (including the two authored by V. Kumar) are explicitly linked to V. 
Kumar by coauthorship or citations. Furthermore, several papers that are closely 
related to those written by V. Kumar focused on graph analysis, while some are 
not so obviously linked. Table 7.8 lists the authors as well to illustrate that such 
results could be used as a starting point for finding authors related to V. Kumar 
that are not necessarily linked by coauthorship or citation. In this case, the author 
W. P. Tang appears to be hnked to V. Kumar. 

Analysis of centroids derived from tensor decompositions can be useful in 
understanding small collections of documents. For example, such analysis could be 
useful for matching referees to papers. In this case, program committee chairs could 
create a centroid for each member on a program committee, and work assignments 
could be expedited by automatically matching articles to the appropriate experts. 

As a segue to the next section, note that finding a set of documents associated 
with a particular author is not always straightforward. In fact, in the example 
above, there is also an author named V. S. A. Kumar, and it is not clear from 
article titles alone that this author is not the same one as V. KuMAR. The next 
section discusses the use of the feature vectors produced by tensor decompositions 
for solving this problem of author disambiguation. 

7.4.4 Author disambiguation 

A challenging problem in working with publication data is determining whether two 
authors are in fact a single author using multiple aliases. Such problems are often 
caused by incomplete or incorrect data or varying naming conventions for authors 
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Table 7.7. Articles similar to GMRES. 

Articles similar to the centroid of articles containing the term GMRES 
using the component matrices of a CP tensor decomposition to compute 
similarity scores. 



Highest-scoring nodes using ^Ag^ + ^Bgg 



Score 


Title 




0.0134 


FQMR: A flexible quasi-minimal residual method with inexact . . . 


0.0130 


Flexible inner-outer Krylov subspace methods 


0.0114 


Adaptively preconditioned GMRES algorithms 


0.0112 


Truncation strategies for optimal Krylov subspace methods 


0.0093 


Theory of inexact Krylov subspace methods and applications to . . . 


0.0086 


Inexact preconditioned conjugate gradient method with inner-outer iteration 


0.0085 


Flexible conjugate gradients 


0.0078 


GMRES with deflated restarting 


0.0065 


A case for a biorthogonal Jacobi-Davidson method: Restarting and . . . 


0.0062 


On the convergence of restarted Krylov subspace methods 


Highest-scoring nodes using Ag^ 


Score 


AgA 


Bgfl 


Title 


0.0240 


0.0019 


Flexible inner-outer Krylov subspace methods 


0.0185 


0.0082 


FQMR.: A flexible quasi-minimal residual method with inexact . . . 


0.0169 


0.0017 


Theory of inexact Krylov subspace methods and applications to . . . 


0.0132 


0.0024 


GMRES with deflated restarting 


0.0127 


0.0003 


A case for a biorthogonal Jacobi-Davidson method: Restarting and . . . 


0.0107 


0.0010 


A class of spectral two-level preconditioners 


0.0076 


0.0011 


An augmented conjugate gradient method for solving consecutive . . . 


Highest-scoring nodes using Bgg 


Score 


Bgfl 


AgA 


Title 


0.0217 


0.0011 


Adaptively preconditioned GMRES algorithms 


0.0158 


0.0014 


Inexact preconditioned conjugate gradient method 
with inner-outer iteration 


0.0149 


0.0074 


Truncation strategies for optimal Krylov subspace methods 


0.0113 


0.0056 


Flexible conjugate gradients 


0.0082 


0.0185 


FQMR: A flexible quasi-minimal residual method with inexact . . . 


0.0080 


0.0007 


Linear algebra methods in a mixed approximation 
of magnctostatic problems 


0.0063 


0.0060 


On tl. started Krylov subspace methods 



used by different pubhcations (e.g., J. R. Smith versus J. Smith). In the SIAM 
articles, there are many instances where two or more authors share the same last 
name and at least the same first initial, e.g., V. TORCZON and V. J. TORCZON. In 
these cases, the goal is to determine which names refer to the same person. 

The procedure for solving this author disambiguation problem works as fol- 
lows. For each author name of interest, we extract all the columns from the matrix 
B corresponding to the articles written by that author name. Recall that the matrix 
B comes from the R — '5Q CP decomposition. Because of the directional citation 
links in 3C(:, :, 5), using the matrix B slightly favors author names that are co-cited 
(i.e., their papers are cited together in papers), whereas using A would have slightly 
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Table 7.8. Similarity to V. Kumar. 

Papers similar to those by V. Kumar using a rank i? = 30 CP tensor 
decomposition. 



Score 


Authors 


Title 


0.0645 


iVdi yjji& vJ ) iv Liiiifii V 


iT. latoL dllLl llllili UUlcLiibV lilUiLllC vCi S^jliClllC 

for partitioning 


0.0192 


Bank RE, Smith RK 


The incomplete factorization multigrapli algorithm 


0.0149 


Tang WP, Wan WL 


Sparse approximate inverse smoother for multigrid 


0.0115 


Chan TF, Smith B, Wan WL 


An energy-minimizing interpolation 
for robust methods . . . 


0.0114 


Hcnson VE, Vassilevski PS 


Element-free AMGe: General algorithms 


0.0108 


Hcndrickson B, Rothbcrg E 


Improving the run-time and quality 
of nested dissection . . . 


0.0092 


Karypis G, Kumar V 


Parallel multilevel fc-way partitioning scheme 
for irregular . . . 


0.0091 


Tang WP 


Toward an effective sparse approximate inverse 
preconditioner 


0.0085 


Saad Y, Zhang J 


BILUM: Block versions of multielimination 
and multilevel . . . 


0.0080 


Bridson B, Tang WP 


A structural diagnosis of some IC orderings 



favored author names that co-cite (i.e., their papers cite the same papers). The cen- 
troid of those columns from B is used to represent the author name. Two author 
names are compared by computing the cosine similarity of their two centroids, re- 
sulting in a value between —1 (least similar) and 1 (most similar). In the example 
above, the similarity score of the centroids for V. TORCZON and V. J. TORCZON 
is 0.98, and thus there is a high confidence that these names both refer to the same 
person (verified by manual inspection of the articles). 

As an example use of author disambiguation, the following experiment was 
performed, (i) The top 40 author names of papers in the data set were selected, 
i.e., those with the most papers, (ii) For each author name in the top 40, all papers 
in the full document collection with any name sharing the same first initial and last 
name were retrieved, (iii) Next the centroids for each author name as in Section 
7.4.3 were computed, (iv) The combined similarity scores using the formula in (7.7) 
were calculated for all papers of author names sharing the same first initial and 
last name, (v) Finally, the resulting scores were compared to manually performed 
checks to see which matches are correct. 

According to the above criteria, there are a total of 15 pairs of names to 
disambiguate. Table 7.9 shows all the pairs and whether or not each is a correct 
match, which was determined manually. 

Figure 7.3 presents plots of the similarity scores for these 15 pairs of author 
names using CP decompositions with R = 15, 20, 25, 30. The scores denoted by 
+ in the figure are those name pairs that refer to the same person, whereas those 
pairs denoted by o refer to different people. Ideally, there will be a distinct cutoff 
between correct and incorrect matches. The figure shows that, in general, most 
correct matches have higher scores than the incorrect ones. However, there are 
several instances where there is not a clear separation between pairs in the two 
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Table 7.9. Author disambiguation. 

Author name pairs to be disambiguated. 



Pair 


Name 1 


Name 2 


Same Person? 


1 


T. Chan 


T. F. Chan 


Yes 


2 


T. Chan 


T. M. Chan 


No 


3 


T. F. Chan 


T. M. Chan 


No 


4 


T. Manteuffel 


T. A. Manteuffel 


Yes 


5 


S. McCORMICK 


S. F. McCORMICK 


Yes 


6 


G. GOLUB 


G. H. GOLUB 


Yes 


7 


X. L. Zhou 


X. Y. Zhou 


No 


8 


R. EWING 


R. E. EwiNG 


Yes 


9 


S. Kim 


S. C. Kim 


No 


10 


S. Kim 


S. D. Kim 


Yes 


11 


S. Kim 


S. J. Kim 


No 


12 


S. C. Kim 


S. D. Kim 


No 


13 


S. C. Kim 


S. J. Kim 


No 


14 


S. D. Kim 


S. J. Kim 


No 


15 


J. Shen 


J. H. Shen 


Yes 



0.8 
0.6 
0.4 
0.2 
0 

-0.2 



1 0.4 

D 

g) 
\a 

E 0.2 

CO 
(A 

b 

0 

-0.2 



o o 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Disambiguation Pair 



(a) R=15 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Disambiguation Pair 



(c) R = 25 



O.i 
0.6 
0.4 
0.2 
0 

-0.2 



1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 
Disambiguation Pair 



R = 20 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Disambiguation Pair 



(d) R = 30 



Figure 7.3. Disambiguation scores. 

Author disambiguation scores for various CP tensor decompositions 
(+ = correct; o = incorrect). 
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Table 7.10. Disambiguation before and after. 

Authors with most papers before and after disambiguation. 



Before Disambuguation 


After Disambuguation 


Papers 


Author 


Papers 


Author 


17 


Q- 


Du 


17 


Q 


Du 


15 


K. 


KUNISCH 


16 


T. 


F. Chan 


15 


U. 


ZWICK 


16 


T. 


A. Manteuffel 


14 


T. 


F. Chan 


16 


S. 


F. McCormick 


13 


A. 


Klar 


15 




Kunisch 




T. 


A. Manteuffel 


10 


u. 


ZwiCK 


13 


S. 


F. McCORMICK 


13 


A. 


Klar 


13 


R. 


Motwani 


13 


R. 


Motwani 


12 


G 


H. GOLUB 


13 


G 


H. GOLUB 


12 


M 


Y. Kao 


12 


M 


Y. Kao 


12 


S. 


MUTHUKRISHNAN 


12 


S. 


MUTHUKRISHNAN 


12 


D. 


Peleg 


12 


D. 


Peleg 


11 


H. 


Ammari 


12 


S. 


D. Kim 


11 


N. 


J. HiGHAM 


11 


H. 


Ammari 


11 


K. 


Ito 


11 


N. 


J. HiGHAM 


11 


H. 


Kaplan 


11 


K. 


Ito 


11 


L. 


Q. Qi 


11 


H. 


Kaplan 


11 


A. 


Srinivasan 


11 


L. 


Q. Qi 


11 


X. Y. Zhou 


11 


A. 


Srinivasan 


10 


N. 


Alon 


11 


X. 


Y. Zhou 



sets — e.g., pairs 8, 13, and 15 in Figure 7.3(a). The CP decomposition with R = 20 
clearly separates the correct and incorrect matches. Future work in this area will 
focus on determining if there is an optimal value for R for the task of predicting 
cutoff values for separating correct and incorrect matches. 

Table 7.10 shows how correctly disambiguating authors can make a difference 
in publication counts. The left column shows the top 20 authors before disambigua- 
tion, and the right column shows the result afterward. There are several author 
names — T. F. Chan, T. A. Manteuffel, S. F. McCormick, G. H. Golub, 
and S. D. Kim — that move up (some significantly) in the list when the ambiguous 
names are resolved correctly. 

One complication that has not yet been addressed is that two different people 
may be associated with the same author name. This is particularly likely in the 
case that the name has only a single initial and a common last name. Consider the 
name Z. Wu — there are two papers in the collection with this author name and five 
others with author names with the same first initial and a different second initial. 
Table 7.11 lists the papers by these authors along with the full first name of the 
author, which was determined by manual inspection. 

Two approaches for solving this name resolution problem are considered: 
treating Z. Wu as a single author and taking the centroid of the two papers and 
treating each paper as separate. In Table 7.12(a), Z. Wu, as the author of two 
papers, appears most similar to author 3. Separating the articles of Z. Wu and 
recomputing the scores provides much stronger evidence that authors lb and 3 are 
the same author, and that author la is most likely not an alias for one of the other 
authors; see Table 7.12(b). 
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Table 7.11. Data used in disambiguating the author Z. Wu. 



ID 


Author 


Title(s) 


la 


Wu Z (Zhcn) 


Fully coupled forward-backward stochastic differential equations and . . . 


lb 


Wu Z (Zili) 


Sufficient conditions for error bounds 


2 


Wu ZJ (Zhijun) 


A fast Newton algorithm for entropy maximization 
in phase determination 


3 


Wu ZL (Zili) 


First order and second order conditions for error bounds 


3 


Wu ZL (Zili) 


Weak sharp solutions of variational inequalities in Hilbert spaces 


4 


Wu ZN (Zi-Niu) 


Steady and unsteady shock waves on overlapping grids 


4 


Wu ZN (Zi-Niu) 


Efficient parallel algorithms for parabolic problems 



Table 7.12. Disambiguation of author Z. Wu. 

(a) Combination of all ambiguous authors. 





1 


2 


3 


4 


1 


1.00 


0.18 


0.79 


0.03 


2 




1.00 


0.06 


0.06 


3 






1.00 


0.01 


4 








1.00 



(b) Separation of all ambiguous authors. 





la 


lb 


2 


3 


4 


la 


1.00 


0.01 


0.21 


0.03 


0.07 


lb 




1.00 


0.09 


0.90 


0.00 


2 






1.00 


0.06 


0.06 


3 








1.00 


0.01 


4 










1.00 



Manual inspection of all the articles by this group of authors indicates that 
authors lb and 3 are in fact the same person, ZiLi Wu, and that author la is not 
an alias of any other author in this group. The verified full name of each author is 
listed in parentheses in Table 7.11. 

The experiments and results presented in this section suggest several ways 
that tensor decompositions can be used for resolving ambiguity in author names. 
In particular, the use of centroids for characterizing a body of work associated with 
an author shows promise for solving this problem. In the next set of experiments, 
though, it can be observed that the utility of centroids may be limited to small, co- 
hesive collections, as they fail to produce useful results for the problem of predicting 
which journal an article may appear in. 

7.4.5 Journal prediction via ensembles of tree classifiers 

Another analysis approach, supervised machine learning with the feature vectors 
obtained in Section 7.4.2, may be used to predict the journal that a given paper is 
published in. 
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Table 7.13. Summary journal prediction results. 



ID 


Joum&l ^J&rriG 


Size 


CorrGct 


Mislabeled as 




STAM T APPT, DYN SYST 


1% 


0% 


2 


(44%) 


2 


SIAM J APPL MATH 


11% 


58% 


6 (10%) 


3 


SIAM J COMPUT 


11% 


56% 


11 


(20%) 


i 


SIAM J CONTROL OPTIM 


11% 


60% 


2 


(10%) 


5 


SIAM J DISCRETE MATH 


5% 


15% 


3 


(47%) 


6 


SIAM J MATH ANAL 


8% 


26% 


2 


(29%) 


7 


SIAM J MATRIX ANAL APPL 


8% 


56% 


10 


(19%) 


8 


SIAM J NUMER ANAL 


12% 


50% 


10 


(16%) 


9 


SIAM J OPTIM 


7% 


66% 


4 


(16%) 


10 


SIAM J SCI COMPUT 


13% 


36% 


8 


(21%) 


11 


SIAM PROC 


9% 


32% 


3 


(38%) 


12 


SIAM REV 


3% 


5% 


2 


(34%) 



The approach from Section 7.4.3 of considering the centroid of a body of 
work does not yield useful results in the case of journals because the centroids are 
not sufhciently distinct. Therefore, classifiers trained on subsets of the data are 
used to predict the journals in which the articles not included in those training 
sets are published. The feature vectors were based on the matrix A from a CP 
decomposition with i? = 30 components. Thus, each document is represented by 
a length-30 feature vector, and the journal in which it is published is used as the 
label value, i.e., the classification. The 5022 labeled feature vectors were split into 
ten disjoint partitions, stratified so that the relative proportion of each journal's 
papers remained constant across the partitions. Ten-fold cross validation was used, 
meaning that each one of the ten partitions (10% of the data) was used once as 
testing data and the remaining nine partitions (90% of the data) were used to train 
the classifier. This computation was done using OpenDT [Banfield ct al. 2004] to 
create bagged ensembles [Dicttcricli 2000] of C4.5 decision trees. The ensemble size 
was 100; larger ensembles did not improve performance. 

Table 7.13 provides an overview of the results giving, for each journal, its 
identification number, its size relative to the entire collection, the percentage of 
its articles that were correctly classified, and the journal that it was most often 
mislabeled as and how often that occurred. For instance, articles in journal 2, 
make up 11% of the total collection, are correctly identified 58% of the time, and 
are confused with journal 6 most often (10% of the time). The overall "confusion 
matrix" is given in Table 7.14; this matrix is obtained by combining the confusion 
matrices generated for each of the ten folds. 

Figure 7.4 shows a graphical representation of the confusion matrix. Each 
journal is represented as a node, and the size of the node corresponds to the per- 
centage of its articles that were correctly labeled (0-66%). There is a directed edge 
from journal i to journal j if journal i's articles were mislabeled as article j . A 
Barnes-Hut forced directed method (using the weighted edges) was used to deter- 
mine the positions of the nodes [licycr 2007]. Only those edges corresponding to 
mislabeling percentages of 5% or higher are actually shown in the image (though 
all were used for the layout); the thicker the edge, the greater the proportion of 
mislabeled articles. 
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Table 7.14. Predictions of publication. 

Confusion matrix of predictions of publication of articles in the different 
SI AM publications. A classifier based on bagging and using decision 
trees as weak learners was used in this experiment. The bold entries are 
correct predictions. 





Predicted Journal 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


1 


0 


14 


4 


1 


1 


4 


0 


3 


1 


1 


2 


1 


2 


1 


318 


19 


46 


3 


54 


13 


31 


7 


41 


12 


3 


3 


0 


29 


303 


24 


29 


5 


15 


8 


7 


10 


109 


1 


4 


0 


57 


21 


346 


2 


34 


20 


12 


51 


22 


11 


1 


5 


0 


12 


122 


9 


40 


4 


15 


2 


1 


2 


53 


0 


6 


0 


120 


19 


56 


1 


108 


15 


58 


3 


34 


5 


1 


7 


0 


23 


11 


22 


5 


8 


235 


18 


18 


81 


2 


0 


8 


0 


56 


13 


47 


0 


37 


37 


304 


13 


98 


5 


1 


9 


0 


10 


19 


55 


1 


4 


10 


5 


228 


1 


10 


1 


10 


0 


77 


7 


32 


0 


36 


98 


135 


23 


237 


7 


4 


11 


0 


37 


176 


21 


34 


12 


9 


8 


7 


13 


149 


3 


12 


1 


48 


13 


12 


2 


13 


16 


6 


6 


10 


8 


7 




Figure 7.4. Journals linked by mislabeling. 



The automatic layout generated by the Barnes-Hut algorithm visually yields 
four clusters, and the nodes in Figure 7.4 are color-coded according to their cluster 
labels. These journals along with their descriptions are presented in Table 7.15, 
and they are clearly clustered by overlap in topics. Observe that, for example. 
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Table 7.15. Journal clusters. 

Journals grouped by how they are generally confused, with descriptions. 



ID Topic 

Red-Colored Nodes: Dynamical Systems 

2 SIAM J APPL MATH: scientific problems using methods that arc of mathematical 
interest such as asymptotic methods, bifurcation theory, dynamical systems theory, and 
probabilistic and statistical methods 

6 SIAM J MATH ANAL: partial differential equations, the calculus of variations, 

functional analysis, approximation theory, harmonic or wavelet analysis, or dynamical 
systems; applications to natural phenomena 

1 SIAM J APPL DYN SYST: mathematical analysis and modeling of dynamical systems 
and its application to the physical, engineering, life, and social sciences 

12 SIAM REV: articles of broad interest 

Green-Colored Nodes: Optimization 
SIAM J CONTROL OPTIM: mathematics and applications of control theory and on 
those parts of optimization theory concerned with the dynamical systems 

9 SIAM J OPTIM: theory and practice of optimization 

Purple-Colored Nodes: Discrete Math and Computer Science 

3 SIAM J COMPUT: mathematical and formal aspects of computer science 
and nonrmmerical computing 

5 SIAM J DISCRETE MATH: combinatorics and graph theory, discrete optimization and 
operations 

research, theoretical computer science, and coding and communication theory 
Tl SIAM PROC: Conference proceedings including SIAM Data Mining, ACM-SIAM 
Symposium on Discrete Algorithms, Conference on Numerical Aspects of Wave 
Propagation, etc. 

Cyan-Colored Nodes: Numerical Analysis 

~7 SIAM J MATRIX ANAL APPL: matrix analysis and its applications 
8 SIAM J NUMER ANAL: development and analysis of numerical methods including 
convergence of algorithms, their accuracy, their stability, and their computational 
complexity 

10 SIAM J SCI COMPUT: numerical methods and techniques for scientific computation 



the scope of SIAM J COMPUT (3) includes everything in the scope of SIAM J 
DISCRETE MATH (5), so it is not surprising that many of the latter 's articles are 
misidentified as the former. In cases where there is little overlap in the stated scope, 
there seems to be less confusion. For instance, articles from the SIAM J OPTIM 
(9) are correctly labeled 66% of the time and the only other journal it is confused 
with more than 5% of the time is the other optimization journal represented in the 
collection: SIAM J CONTROL OPTIM (4). Note that the SIAM J CONTROL 
OPTIM (4) does include dynamical systems in its description and is, in fact, linked 
to the "dynamical systems" cluster. 



7.5 Related work 

7.5.1 Analysis of publication data 

Researchers look at publication data to understand the impact of individual authors 
and who is collaborating with whom, to understand the type of information being 
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published and by which venues, and to extract "hot topics" and understand trends 
[Boyack 2004]. 

As an example of the interest in this problem, the 2003 KDD Cup challenge 
brought together 57 research teams from around the world to focus on the analysis 
of publication data for citation prediction (i.e., implicit link detection in a citation 
graph), citation graph creation, and usage estimation (downloads from a server of 
preprint articles) [Gehrke et al. 2003]. The data were from the high-energy physics 
community (a portion of the arXiv preprint server collection*). For this challenge, 
McGovern et al. looked at a number of questions related to the analysis of publica- 
tion data [McGovern et al. 2003]. Of particular relevance to this paper, they found 
that clustering papers based only on text similarity did not yield useful clusters. 
Instead, they applied spectral-based clustering to a citation graph where the edges 
were weighted by the cosine similarity of the paper abstracts — combining citation 
and text information into one graph. Additionally, for predicting in which jour- 
nal an article will be published, they used relational probability trees (see Section 
7.5.3). 

In other work, Barabasi et al. [Barabasi et al. 2002] considered the social net- 
work of scientific collaborations based on publication data, particularly the prop- 
erties of the entire network and its evolution over time. In their case, the data 
were from publications in mathematics and neuroscience. The nodes correspond to 
authors and the links to coauthorship. 

Hill and Provost [Hill Provost 2003] used only citation information to predict 
authorship with an accuracy of 45%. They created a profile on each author based 
on his/her citation history (weighting older citations less). This profile can then 
be used to predict the authorship of a paper where only the citation information is 
known but not the authors. They did not use any text-based matching but observe 
that using such methods may improve accuracy. 

Jo, Lagoze, and Giles [Jo et al. 2007] used citation graphs to determine topics 
in a large-scale document collection. For each term, the documents (nodes in the 
citation graph) were down-selected to those containing a particular term. The 
interconnectivity of those nodes within the "term" subgraph was used to determine 
whether or not it comprises a topic. The intuition of their approach was that, 
if a term represents a topic, the documents containing that term will be highly 
interconnected; otherwise, the links should be random. They applied their method 
to citation data from the arXiv (papers in physics) and Citeseer^ papers in computer 
science) preprint server collections. 

7.5.2 Higher order analysis in data mining 

Tensor decompositions such as CP [Carroll & Chang 1970, Harshman 1970] and 
Tucker [Tucker 19GG] (including HO-SVD [De Lathauwcr et al. 2000] as a special 
case) have been in use for several decades in psychometrics and chemometrics and 
have recently become popular in signal processing, numerical analysis, neuroscience. 



http: / /www. arXiv.org/. 
thttp:/ /citeseer. ist.psu.edu/. 
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computer vision, and data mining. See [Kolda & Bader 2009] for a compreliensive 
survey. 

Recently, tensor decompositions fiave been applied to data-centric problems 
including analysis of click-through data [Sun et al. 2005] and chatroom analysis 
[Acar ct al. 2005, Acar ct al. 200(1]. Liu et al. [Liu ct al. 2005] presented a tensor 
space model which outperforms the classical vector space model for the problem 
of classification of Internet newsgroups. In the area of web hyperlink analysis, the 
CP decomposition has been used to extend the well-known HITS method to incor- 
porate anchor text information [Kolda ct al. 2005, Kolda & Bader 2006]. Bader et 
al. [Bader et al. 2007a, Bader ct al. 20071)] used tensors to analyze the communica- 
tions in the Enron e-mail data set. Sun et al. [Sun ct al. 2006a, Sim ct al. 2006b] 
dynamically updated Tucker models for detecting anomalies in network data. Ten- 
sors have also been used for multiway clustering, a method for clustering entities of 
different types based on both entity attributes as well as the connections between 
the different types of entities [Banerjee et al. 2007]. 

7.5.3 Other related work 

Cohn and Hofmann [ixjlni tV;. Hofniann 2001] developed a joint probability model 
that combines text and links, with an application to categorizing web pages. Re- 
lational probability trees (RPTs) [Gctoor et al. 2003, Gctoor & Dichl 2005] offer a 
technique for analyzing graphs with different link and node types, with the goal of 
predicting node or link attributes. 

For the problem of author disambiguation, addressed in this paper, Bekkerman 
and McCallum [ I'kkornian (k: McCallmn 2(J05] have developed an approach called 
multiway distributional clustering (MDC) that clusters data of several types (e.g., 
documents, words, and authors) based on interactions between the types. They used 
an instance of this method for disambiguation of individuals appearing in pages on 
the web. 

7.6 Conclusions and future work 

Multiple similarities between documents in a collection are represented as a three- 
way tensor (iV x N x K), the tensor is decomposed using the CP-ALS algorithm, 
and relationships between the documents are analyzed using the CP component 
matrices. How to best choose the weights of the entries of the tensor is an open 
topic of research — the ones used here were chosen heuristically. 

Different factors from the CP decomposition are shown to emphasize different 
link types; see Section 7.4.1. Moreover, the highest-scoring components in each 
factor denote an interrelated community. The component matrices (A and B) of 
the CP decomposition can be used to derive feature vectors for latent similarity 
scores. However, the number of components (i?) of the CP decomposition can 
strongly influence the quality of the matches; see Section 7.4.2. The choice of the 
number of components (R) and exactly how to use the component matrices are open 
questions, including how to combine these matrices, how to weight or normalize the 
features, and whether or not to incorporate the factor weightings, i.e., A. 
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This brings us to two disadvantages of the CP model. First, the factor matri- 
ces are not orthogonal, in contrast to the matrix SVD. A possible remedy for this is 
to instead consider the Tucker decomposition [Tucker 1966], which produces orthog- 
onal component matrices and, moreover, can have a different number of columns 
for each component matrix; unfortunately, the Tucker decomposition is not unique 
and does not produce rank-one components like CP. Second, the best decomposition 
with R components is not the same as the first R factors of the optimal decompo- 
sition with S > R components, again in contrast to the SVD [Kolda 2001]. This 
means that we cannot determine the optimal R by trial-and-error without great 
expense. 

The centroids of feature vectors from the component matrices of the CP de- 
composition can be used to represent a small body of work (e.g., all the papers with 
the phrase "GMRES") in order to find related works. As expected, the feature vec- 
tors from the different component matrices produce noticeably different answers, 
either one of which may be more or less useful in different contexts; see Section 
7.4.3. Combining these scores can be used to provide a ranked list of relevant work, 
taking into account the most relevant items from each of the component matrices. 

A promising application of the similarity analysis is author disambiguation, 
where centroids are compared to predict which authors with similar names are 
actually the same. The technique is applied to the subset of authors with the most 
papers authored in the entire data set and affects the counts for the most published 
authors; see Section 7.4.4. In future work, we will consider the appropriate choice 
of the number of components (R) for disambiguation, identify how to choose the 
disambiguation similarity threshold, and perform a comparison to other approaches. 

Using the feature vectors, it is possible to predict which journal each article 
was published in; see Section 7.4.5. Though the accuracy was relatively low, closer 
inspection of the data yielded clues as to why. For example, two of the publications 
were not focused publications. Overall, the results revealed similarities between the 
different journals. In future work, we will compare the results of using ensembles 
of decision trees to other learning methods (e.g., fc-nearest neighbors, perceptrons, 
and random forests). 

We also plan to revisit the representation of the data on two fronts. First, we 
wish to add authors as nodes. Hendrickson [Hendrickson 2007] observes that tcrm- 
by-document matrices can be expanded to be (term plus document)-by-(term plus 
document) matrices so that term-term and document-document connections can be 
additionally encoded. Therefore, we intend to use a (document plus author) dimen- 
sion so that we can explicitly capture connections between documents and authors 
as well as the implicit connections between authors, such as colleagues, conference 
co-organizers, etc. Second, in order to make predictions or analyze trends over time, 
we intend to incorporate temporal information using an additional dimension for 
time. 

Though the CP decomposition has indications of the importance of each link in 
the communities it identifies (see Section 7.4.1), we do not exploit this information 
in reporting or computing similarities. As noted in [Ramakrislman vt al. 200-^)], 
understanding how two entities are related is an important issue and a topic for 
future work. 
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The reasons that the spectral properties of adjacency matrices aid in clustering 
are beginning to be better understood; see, e.g., [Brand & Huang 200.3]. Similar 
analyses to explain the utility of the CP model for higher order data are needed. 
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Subgraph Detection 



Jeremy Kepner* 



Abstract 

Detecting subgraphs of interest in larger graphs is the goal of many 
graph analysis techniques. The basis of detection theory is computing 
the probability of a "foreground" with respect to a model of the "back- 
ground" data. Hidden Markov Models represent one possible foreground 
model for patterns of interaction in a graph. Likewise, Kronecker graphs 
are one possible model for power law background graphs. Combining 
these models allows estimates of the signal to noise ratio, probability of 
detection, and probability of false alarm for different classes of vertices 
in the foreground. These estimates can then be used to construct fil- 
ters for computing the probability that a background graph contains a 
particular foreground graph. This approach is applied to the problem 
of detecting a partially labeled tree graph in a power law background 
graph. One feature of this method is the ability to a priori estimate the 
number of vertices that will be detected via the filter. 

8.1 Graph model 

Graphs come in many shapes and sizes: directed, undirected, multi-edge, and hy- 
pergraphs, to name just a few. In addition, graphs are far more than just a set of 
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connections. They also contain metadata about the vertices and edges. The first 
step in developing detection theory for graphs is to define a high dimensional model 
of a graph that can handle this diversity information. If there is an edge from vertex 
i to vertex j at time t of type I, then let the N vertex graph adjacency tensor have 

•^{i,j,t) = I 

where A : z^^^^^^^. Likewise, if there is no edge, then A{i,j,t) = 0. 
Other convenient forms of the same notation include 

A{i,j,k{t),l) = l 

where A : B-'Vx^x^fcX^i and k{t) represents the binning of t into discrete time 
windows. If Cijk is a pointer to an edge, then another form is 

A{i, j, k) = Cijk 

where A : 2^^^^^*= . In this form, each edge can also have metadata in the form of 
an arbitrary number of additional parameters such as t{eijk) and l{eijk). Likewise, 
vertices can also have similar parameters t{vi) and l{vi). Finally, the number of 
edges in the graph is computed by counting the number of nonzero entries in A: 

M=\A\= ^{yi(z,j,fc,0^0} 

i,j,k,l 

8.1 .1 Vertex/edge schema 

The above multidimensional adjacency matrix description of a graph allows a rich 
variety of data to be directly incorporated into the graph model. However, the 
model says nothing about what is a vertex and an edge. In some instances, the 
vertices and edges are obvious. For example, in a webgraph (i.e., a graph of links 
between web pages) the vertices are webpages and A{i, j) = 1 implies that webpage 
i has a link pointing to webpage j. Similarly, in a citation graph, A(i, j) = 1 implies 
that paper i cites paper j. In both of these graphs, all the vertices are of the same 
type (i.e., all webpages or all documents) and that type happens to be the same type 
as the items in the corpus (webpages or documents). Such graphs can be referred 
to as having monotype vertices. 

Another common type of graph associates entities in the items of the cor- 
pus with those items. For example, a graph showing all the words in a corpus 
of webpages or documents might have a graph in which A(i, j) = 1 implies that 
webpage/paper i contains word j. Such graphs can be referred to as having dual- 
type vertices. These dual- type vertex graphs can be converted to monotype vertex 
graphs by squaring the adjacency matrix with itself. A document x word adjacency 
matrix can be converted into a word x word adjacency matrix via the inner square 
product 

W = A^A 
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people words 



author dept institution title abstract text 




Figure 8.1. Multitype vertex/edge schema. 

Edges point to documents. All pairs of items in a document create an 
edge. 

where W(i, j) contains the number of documents that contain both words i and j. 
Likewise, a document x document matrix can be constructed via the outer square 
product 

D = AA^ 

where W(i, j) contains the number of words that are in both documents i and j. 

The utility of a graph for answering particular questions is usually dictated 
by the vertex/edge schema. The more specific the schema, the better it will be at 
addressing a specific question. When the questions to be answered are still being 
formulated, it may not be apparent how to organize the graph or even what is a 
vertex or an edge. In this situation, a multitype vertex schema is often useful as it 
can be applied to a range of data. In a multitype vertex graph, the items in the 
corpus are the edges and the vertices are many types: documents, people, words, 
etc. Each document generates a clique of edges all pointing to that document. 
In a multitype vertex graph, all interrelationships between all entities are stored 
(see Figure 8.1). Clearly, the size of the multitype vertex graph can be significantly 
larger than a monotype or dual-type vertex graph. Once the data is understood and 
the kinds of questions to be answered are defined, then the graph can be simplified 
to the appropriate monotype vertex or dual-type vertex graph. 
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8.2 Foreground: Hidden Markov model 

The first component of a detection theory is a mathematical model for the ob- 
ject that is being sought, in this subgraph embedded in a larger graph. 
There are many ways to formulate such a model. Here a hidden Markov model 
(HMM) is used to model the underlying evolving process that generates the sub- 
graph [Wcinstcin ct al. 2009]. It may be the case that more is known about these 
underlying processes (e.g., how a vertex is added to a subgraph) than about the 
specific subgraphs themselves. 

An HMM describes the evolution of a system between states s. The probability 
of transitioning from state s to s' is given by the HMM transition probability matrix 
S?i(s,s'). Within each s, a number of possible edges Ciji that are characteristic of 
that state could be observed in the graph. The probability of observing an edge Ciji 
in state s is given by the observation matrix B;i(s, eiji). For a given state /i(s), this 
can be receded as a probability graph 

J^h(s){i,i,l) = 'Bh{s,e.,ji) 

If the states are sequential (i.e., no revisiting of earlier states), then the probability 
graph for each state is stationary (i.e., does not change over time). Thus, s can be 
substituted for t and the probability graph for the entire HMM is 

Ahii,j,sJ) = Bh{s,etji) 

8.2.1 Path moments 

A model of the underlying process used for generating a subgraph is only practical 
if there is some way to connect the model to real observations (e.g., an observed 
path in a graph). This connection can be achieved by computing the a priori 
probability of a particular path from the model and then comparing it with the 
observations. Likewise, the observations of particular paths can be used to compute 
the parameters in the model. 

Consider any path of length r — 1 consisting of vertices V1V2V3 . . .Vr and edges 
ei2e23e(r-i)rj where the vertices are not necessarily unique. Such a path will gen- 
erate a specific sequence of vertex and edge parameters (or types): 

or 

jvje ivie JV JV le jv 
nn2*2'23'3 ■ • • 'r-l'(r-l)r'r 

The distribution of all paths < r is denoted by Pr where 

-^i-l'l I '121 '2*23' '3' ■ ■ ■ ' V-li '(r-l)r' V7 ~ ^ 

Typically, distributions will be limited to smaller values of r. P(l^) is the distribu- 
tion of all vertex types. P{1'^) is the distribution of all edge types. P{1'" ,1'^) is the 
joint distribution of vertex and edge types and is the probability that a particular 
vertex type will have an edge of a particular edge type. -P(^i , ^127 ^2) joint 
distribution of all vertex, edge, vertex types and is the probability that two vertices 
of given vertex types will have an edge of a particular edge type. 
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Special cases 

If the number of vertex types is equal to the number of vertices 7V;i' = N and there 
is only one edge type iVje = 1, then the P{1^) is simply the degree distribution of 
the graph 

Pin = P{l{vi)) = \A{i, :,:)\=Y1 -^(^'i' ^) 

or, equivalently, 

Pin = P{l{vj)) = \A{:,j,:)\ =^A{i,j,k) 

i,k 

The joint distribution P(^i , ID is simply the N x N adjacency matrix of the graph 
P{ll,n = P{l{vi),l{vj)) = A{i,j) = \A{i,j,:)\ = 

k 

If the number of vertex types is one {Ni^ =1) and the number of edge types 
equals the number of edges (A^/e = M), then is the edge distribution of the 

graph 

p(r) = |>i = r| = ^{G(z,j,/c) = r} 

If the number of vertex types is equal to the number of vertices Ni^ = N and 
the edge types equal the number of edges Nie = M, then P{li, I2, ^12) recovers the 
original graph 

P{P{ll,llll2))=P{l{vi),l{vj),l{eij))=A{i,j,k) 

Expected paths 

If the vertex and edge types are uncorrelated, then the joint probability is the 
product of the probabilities of vertex and edge types 

P{l{vi),livj),lieij)) = P{livi))P{livj))Pilieij)) 

Comparing the ratio of the left side to the actual counts in the data can be used 
to determine if any higher order correlations exist within data. Note: if A(i,j,k) 
has strong correlations (e.g., in a power law distributions), then the above formula 
may need to be modified to take into account Poisson sampling effects (i.e., if one 
vertex has many more edges, then this will create "artificial" correlations with that 
vertex type). 

For a given Afi{s), the expected path-type distributions can be computed by 
integrating over all paths Ph(s) ■ In fact, this distribution can be used to define the 
h{s). In other words, two HMM states can be viewed as path equivalent if their 
expected path type distributions are the same. 

Computing Pr from the graph and comparing with all Ph{s) allow the proba- 
bility that a particular vertex is part of a particular h{s) to be computed. 
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8.3 Background model: Kronecker graphs 

The second component of a detection theory is a mathematical model for the back- 
ground. There are many ways to formulate such a model. Here a Kronecker prod- 
uct approach is used because it naturally generates a wide range of graphs. The 
Kronecker graph generation algorithm (see [Lcskovcc 2005, Chakrabarti 2004]) is 
quite elegant and can be described as follows. First, let A e ^MbMcxNbNc ^ 
B e RA/bxATb^ Q g kA/cxATc, Then the Kronecker product is defined as 
follows (see [Van Loan 2000]) 



B 



/ B(1,1)C B(1,2)C ••• B(1,Mb)C \ 
B(2,1)C B(2,2)C ••• B(2,Mb)C 

VB(iVs,l)C B(iVB,2)C ••• B{Nb,Mb)C J 

Now let A e K^^^ be an adjacency matrix. The Kronecker exponent to the power 
k is as follows 

j^m ^ A®'=-i (g) A 

which generates an N'^ x N'^ adjacency matrix. This simple model naturally pro- 
duces self-similar graphs, yet even a small A matrix provides ample parameters to 
fine-tune the generator to a particular real- world data set. The model also lends 
itself to the analytic computation of a wide range of graph measures. 

The physical intuition as to why Kronecker products naturally generate power 
law graphs is as follows. Given a simple single state HMM 

The product Ah{i, j)Ah{i' , j') is the probability of a vertex being reached via the 
HMM process eijCi'j'. Assuming that all such edges in a graph are the result of 
such a process, then A^'^ contains the probability of all possible processes of length 
k. Thus each edge in the resulting adjacency matrix indicates the precise set of 
HMM process steps used to generate that edge. 



8.4 Example: Tree finding 

The aforementioned detection theory concepts are illustrated by applying them to 
the simple example of finding a small tree embedded in a larger power law graph. 
This problem is related to the minimal spanning tree problem (see, for instance, 
[Purer & Raghavachari 1992] and the references therein). 



8.4.1 Background: Power law 

Consider a directed Kronecker graph with M edges and N — 2^ vertices with 
boolean adjacency matrix A : B^^^ given by 

Ai^ Af 
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where the generator Ah : E^^^ has values such that the resulting graph is a power 
law graph. Typical values of are (see http://www.GraphAnalysis.org/benchniark) 

/ 0.55 0.1 \ 
" 0.1 0.25 J 

The above matrix describes a stationary HMM process whereby the probability of 
staying in the same vertex is higher than transitioning to the other vertex. 

8.4.2 Foreground: Tree 

Consider a tree consisting of kx levels and tit branches per node. An HMM process 
for generating such a tree consists of kx products beginning with the branch process 
given by the row vector of ones lixnr- Next, fc^-l applications of the identity matrix 
InrxriT replicates each branch ht times. Finally, the multiplication by the column 
vector creates the leaf nodes 

( 1 0^ )' 

where ' denotes matrix transpose. Using this process, the adjacency matrix Atxt '■ 
-qNtxNt describing a directed tree with Nt = nodes can be constructed by 

TlT — 1 

Atxt - ( 1 0^ )' ® It'ln' ® llxn. 

The above tree will have kr levels and jit branches per node. Each branch node 
will have a degree of + 1 and each leaf will have a degree of 1. See Figure 8.2 
for the case in which nx = 2 and kx = 4. 

8.4.3 Detection problem 

Let r be a set of Nt random vertices drawn from [1, iV]. The tree is inserted into the 
Kronecker graph via the assignment operation A(r, T) = A^xt- The detection 
problem is to find T given A. Given this information, the only means for identifying 
T would be topological. The primary topological tool that can be used to find T 
is its degree distribution. For a nominal power law graph where M/N ^ 8, and 
a nominal binary tree {ut — 2) with Mt/Nt ^ 2, the degree distribution can be 
used to identify vertices that are more likely in T. Unfortunately, if N and M are 
large compared to Nt and Mt, this problem is extremely challenging (see SNR 
calculation in next section) and likely does not have a unique solution because A 
will contain many trees that are topologically identical to Atxt- 

A common variant of the above problem is the case in which some of the 
vertices in the tree are known and some are not. Thus, the problem can be made 
more tractable if additional metadata on the vertices or edges are provided that 
allow a subset y of T to be identified such as V — {v : l{v) — l^} ot V = {v : 
l{evv') — l^}. T is then split into two sets such that T = V UU* (see Figure 8.3), 
where / — Ny /Nt and 1 — / = Nu* /Nt- The problem is then to find U* given V . 
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9 10 11 12 13 14 15 16 



Figure 8.2. Tree adjacency matrix. 

Graphical and algebraic depiction of the adjacency matrix of a tree with 
krp — 4: levels and ht — 2 branches per node. 




Figure 8.3. Tree sets. 

Adjacency matrix of a tree split into two sets T = V U U*. 
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Figure 8.4. Tree vectors. 

Adjacency matrices of a tree split into two vectors t = v + u"^. 



8.4.4 Degree distribution 

The primary topological tool that can be used to find T is the degree distribution. 
The degree of each vertex in the tree can be computed as follows 



Itxt — (AtxT + A^xt) 



For a tree, half the vertices will have the max degree = + 1 , and the other 

half will have a degree of 1. 

The above type of calculation can be simplified by introducing the following 
set/ vector notation. Let set T also be represented by the boolean column vector 
t : B-*^^^ where t(r) = 1 and zero otherwise (see Figure 8.4). Furthermore, let 
|t| = Nt denote the number of nonzero elements in t. The advantages of using t 
instead of T are that it preserves the overall context of the graph and allows creating 
subsets with less bookkeeping. Let It = diag(t) be the identity matrix with the 
vector t along the diagonal. The subgraph consisting of T can now be written as 

Atxt = At2 = It A It 

From At2 : B^^^, the vertex degrees can be computed as follows 

dt2 = (At2 + At2) I^VXI 

The nonzero values of dt2 will be the 

Often, it will be necessary to compute the degree of a set of starting vertices 
V into a set of ending vertices u. This begins by computing the corresponding 
adjacency matrices 

Avxu = Iv A lu 

and 

Auxv — lu A Iv 
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The resulting degree distribution is then 

dvxu — (A-vxu "I" -^uxv) IjVxl 

8.5 SNR, PD, and PFA 

An essential aspect of detection is estimating the difficulty of the problem. The 
signal to noise ratio (SNR) is a standard tool for computing this difficulty. In the 
context of subgraph detection, SNR will be defined as follows. Given a set of vertices 
selected containing some number of foreground vertices, the SNR is 

SNR({foreground vertices selected} C {vertices selected}) 

#foreground vertices selected 
#background vertices selected 

At first glance, the overall SNR of any subgraph detection problem can be lower 
bounded by selecting all the vertices in the graph 

SNR(u* C a) = |u*|/|a - u*| w \u*\/N 

where |u*| is the number of nonzero elements in u* and a = Iatxi is the vector 
representing all the vertices in A. For a nominal power law graph {N = 2'^^) and 
tree {Nt = 2\ f = 1/2), the SNR(u* C a) = 2-^^. 

Another important tool for assessing detection is the probability of detection 
(PD) and can be defined as 

PD( {foreground items selected} C {foreground items}) 

^foreground items selected 
^foreground items 

By definition, if all the items in the graph arc selected, then PD = 1. 

Probability of detection is usually computed in concert with the probability 
of false alarm (PFA), which is defined as 

PFA({background items selected} C {items selected}) 

#background items selected 
#items selected 

If all the items in the graph are selected, then 

PFA(a- u* C a) = « 1 - SNR(u* c a) = 1 - 2"" 

Combined together, the SNR, PD, and PFA are useful tools for assessing 
the difficulty of a detection problem. For example, problems with a low SNR are 
more difficult than problems with a high SNR. By computing the SNR and then 
identifying ways to increase the SNR, a simple filtering algorithm will emerge for 
the tree detection example problem. 
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8.5.1 First and second neighbors 

The first step toward getting a liigher SNR for u* is to use the metadata that 
identified v C t and recognizing that u"* will most likely be found in the first and 
second nearest neighbors of v. 

The number of first and second nearest neighbors to v can be approximated 

by 

|vi| « 2(M/Ar)|v| 

and 

|vii| w 2(M/Af)|vi| = A{M/Nf\v\ 

where the factors of two come from not differentiating between incoming and out- 
going edges. 

Assuming that u* C vi + vn, then the SNR can be computed as 

lu*l |u*| 

SNR(u* C VI + v„) = « ^ 

|vi + vn-u*| |vii| 

^ (l-/)|v|// 
A{M/NY\v\ 

= [(l-./)//]/[4(M/7V)'] 

For a nominal power law graph where M/N = 8 and a nominal binary tree {tit = 2) 
with / = 0.5, SNR(u* C vi + vn) w 2~^. Likewise, by assumption PD w 1 and 
PFA w 1 - SNR. 

8.5.2 Second neighbors 

The SNR for the tree detection problem can be refined by recognizing that nearly 
all vertices in u* can be categorized into two classes with very different SNRs 

U* « Uj + U*j 

where Uj are in the set of v's first neighbors vj, and Ujj are in the set of v's second 
neighbors Vn. 

Now suppose I Ujj I can be approximated by the number of "fallen leaves" and 
"fallen branches" on the tree. A fallen leaf is a leaf node in u* whose branch node 
back to the tree is also in u*. The number of leaf nodes in a tree is approximately 
(1 — n^^)|t| and the number of leaf nodes in u* is approximately (1 — /)(! — n^^)|t|. 
The probability that a leaf node's branch node is in u* is 1 — /; thus, the number of 
fallen leaves is (1 — — n^^)|t|. A fallen branch occurs when a branch node and 
all its neighbors are in u*. The number of branch nodes in the tree is approximately 
\t\/nt with approximately (1 — / )|t|/nt in u*. The probability that all the neighbors 
of a branch are also in u* is (1 — Z)"^"'"^; thus, the total number of fallen leaves and 
branches is 

lufil « (1 - ff{l - n^')\t\ + (1 - /)"-+2|t|M 

= (l_/)2[l_„-l+„-l(l_J)nT]|t| 
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The corresponding SNR is 

SNR(u^i C v„) = , '""L, « ^ 

|vil-uj^j| I VII I 

^ (l-/)2[i-n-i+nrHl-/r]|t| 
4(M/Ar)2|v| 

4(M/A/')2/ 

For a nominal power law graph and tree, the SNR(ujj C Vu) = 5/2^^. Likewise, 
the PD is 

PDKi C u*) = Ki|/|u*| « 5/2* 

Similarly, the PFA is 

PFA(vii - ui^i c vii) = tlL^^ « 1 - SNR(u?i c v„) = 1 - 5/2^^ 

Note that SNR(uJi C Vn) is much lower than SNR(u'' C vi + vn). Thus, 
by not including second neighbors and only targeting first neighbors, it should be 
possible to increase the SNR. 

8.5.3 First neighbors 

By restricting the selection to the first neighbors vi, the SNR can be increased 

sNR(ur c vi) = « i^^;^ 

|vi — UjI |vi| 

_ 1-Kil/M 

|vi|/|u1 
_ 1 - PD(u^j C u*) 
- 2(M/iV)(l-/)// 

For a nominal power law graph and tree, the SNR(ui C vi) = 11/2^. Likewise, the 
PD is 

PD(u^ C u*) = |u^|/|u*| = |u* - <i|/|u*| = 1 - PD« c u*) = 1 - 5/2^ 
Similarly, the PFA is 

PFA(vi - C vi) = 1 - SNR(u^ c vi) = 1 - 11/2^ 

8.5.4 First neighbor leaves 

Clearly, the SNR of the first neighbors will be higher than the SNR of the second 
neighbors or of the first and second neighbors combined. Likewise, the first neigh- 
bors can also be divided into two classes of vertices with higher and lower SNR. Let 
Ui be made up of the branch and leaf vertices Ug and Ul such that 
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Ug are branch vertices of t that are first nearest neighbors of v. are leaf vertices 
of t that are also first nearest neighbors of v, so that Ug + is the set all the first 
neighbors of v in u*. 

The number of vertices ju^l can be approximated by adding the number of 
leaves and leaflike branches in u*. The probability that a leaf node's branch node is 
in V is /, and so the number of leaf nodes in u* with branches in v is approximately 
(1 - /)/(! - "?^)|t|- A leaflike branch occurs when a branch node has only one of 
its branches u*, and so appears to look like a leaf node. In other words, branches 
in u* will become leaflike when all but one of their neighbors is in u*, which occurs 
with a probability of {ut + 1)(1 — Z)"^/. Thus, the total number of leaflike nodes 
in u* is 



For a nominal power law graph and tree, |ul|/|vi| » 7/2®. The corresponding SNR 
is 



SNRK C vi) = |u^|/|vi - <| = (|u1/|<| - « |uj,|/|vi| = 7/2' 
The PD is 

PDKcu*) = |u*|/|u*| 

_ [(1 - /)(1 - n^') + (1 - + n^^)]/|t| 

(l-/)|t| 

= [(1 - n^') + (1 - /)"-- V(l + nr')]f = 5/2^ 

Likewise, the PFA is 

PFA(vi - u£ C vi) = |vi - u^|/|vi| = 1 - |u£|/|vi| = 1 - 7/2« 

8.5.5 First neighbor branches 



Probably the easiest nodes to identify are tree branches that share neighbors in v. 
The number of branchlike nodes that are neighbors of v is not large. The probability 
that two branch neighbors happen to be the same node by chance is 



The number of branch nodes in a tree is approximately |t|/nT and the number of 
branch nodes in u* is (1 — f)\t\/nT- A branch node will appear as a branch if two 

or more of its neighbors arc in v. which has a probability of 1 — (/"^^^ + {riT + 
1)(1 — f)"'^ f). Thus the number of branch nodes that will appear branchlike is 



K\ « (1 - /)/(! - nr')\t\ + (nr + l)fil /)"-|t|M 
« [(1 - /)(! - n^i) + (f - + n^i)]/|t| 



and 



|u£|/|vi| = 



(1 -/)(!- n^^) + (l -/)"-/(! + n^^) 

2{M/N) 



|vb| « IviIVA^ = A{M/Nf\w\^/N = A{M/Nf\t\yN 



\<\ « [1 - (/ 



+ (nT + l)(l -/)"-/)](! -/)|t|/nT 
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The corresponding SNR is 

SNRKcu^+vb)-|u^|/|vb| 

^ [1 - (/""+^ + (nr + 1)(1 - /)""/)](! - mi/nr 

4(Af/A^)2|t|2/A^ 
_ [1 - (/""+^ + {riT + 1)(1 - /)""/)](! - /) (iV/|t|) 
4(M/Af)2/2^T 

For a nominal power law graph and tree, SNR(u3 C Ugj + vb) = (iV/|t|)/2^ — 2^. 
The PD is 

PDKcu'^) = |u*|/|u*| 

^ [1 - (/""+^ + {nr + 1)(1 - /)""/)](! - /)|t|/nT 
(l-/)|t| 

= [1 - + (nr + 1)(1 - /)"-/)] W«t) = 

Likewise, the PFA is 

PFA(VB C u* + vb) = IvbI/Iu-^ + vbI = [1 + Iu-^IZ/vbO-i 
= [1 + SNRK cu* +vb)]-i 
« l/SNR(u^ C + vb) = 2-'^ 



8.5.6 SNR hierarchy 

The analysis of the previous sections indicates that the vertices in u"* can be grouped 
into several classes with very different SNRs (see Figure 8.5). These different classes 
indicate that certain vertices in u* should be easy to detect while others will be 
quite difficult. The filtering approach described subsequently targets the higher 
SNR groups Ul and Ug. 



SNR(Ul*cV|) SNR(Ub*c Ub*+ Vb) 

H ' 



0) 
(/) 

X 
0) 

t 
> 



SNR(U||*c V||) SNR(U|*cV|) 

^ ' 



SNR(u*c a) 



relative signal-to-noise ratio (SNR) 



Figure 8.5. SNR hierarchy. 

The SNR of different sets of unknown tree vertices u* as they are selected 
from the background. 
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8.6 Linear filter 

This section presents a simple linear filtering approach for finding u"^ and follows 
the steps shown in the previous section. At each step, the goal is to increase the 
SNR until the SNR is sufficiently high that the probability of finding vertices in u"^ 
is good. 

The algorithm relies on the fact that the maximum degree of a node in the 
tree is ut + 1. This approach is chosen — as opposed to an approach using more 
tree-specific attributes — because in real applications a tree is most likely an approx- 
imation to the subgraph of interest. Thus, the algorithm is applicable to the class 
of subgraphs that have a maximum vertex degree. 

The SNR analysis relied on a linear model of the background and assumed 
that all vertices had on average M/N edges. In reality, the degree distribution of 
the background is a power law. At several stages in the algorithm, adjustments will 
be made to eliminate the occasional high degree vertex. 

The boundary between linear and nonlinear detection can be a little fuzzy. In 
this case, linear detection is characterized by not using nonlinear operations (e.g., 
A^) or recursive global optimization techniques to find the tree. 

8.6.1 Find nearest neighbors 

To begin, recall t = v + u*. Step 0 is to find all nearest neighbors Uq of t by 
computing the adjacency matrices 

.^vxuo ~ Iv ^ .^v^ , Au^xv = A Iv Av2 

The degree of the neighboring vertices with edges into v is then 

duoxv = (Auoxv + A-^xufj) l-^Vxl 

Uo can now be found by finding all nonzcros in du^xv 

Uo = (duo XV > 0) 

This step selects the first nearest neighbors of V (see Figure 8.6). 

8.6.2 Eliminate high degree nodes 

One property of a power law graph is the existence of very high degree nodes that 
cannot be accounted for in linear theory. Likewise, a property of trees is that the 
maximum vertex degree is small and can be approximated by « d™^^. Step 

0 is used to exploit this constraint. In general, step 0 is highly nonlinear in that it 
mostly eliminates a very few candidate vertices, but occasionally eliminates a large 
number. 

Step 1 eliminates from Uq those vertices whose number of edges with v is 
greater than d™""^ 

ui = (du„xv < ■ * Uo 

This step eliminates some very high degree vertices that are chance neighbors of v 
(see gray vertical bar at far right of Figure 8.7). 
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Figure 8.6. Tree filter step 0. 

Uo is the set of vertices that are neighbors of V. Uq includes Uq U U* 
(first neighbors) but not C/j (second neighbors). 



Figure 8.7. Tree filter steps la and lb. 

V are vertices with available edges (i.e., dy2 < dy2^). Ui are vertices 
with available edges (i.e., du^xv < dy^^). Gray vertices are eliminated 
by applying these constraints. U2 are those vertices that remain. 



8.6.3 Eliminate occupied nodes 

Step 2 eliminates from v those vertices that are completely occupied 

V = V - (d,2 > d^r) 

Adding a neighbor to any node v would cause the degree of the node to exceed 
fj^max ^ U2 is then computed by restricting the known tree vertices to v 

U2 = (duixv > 0) 

This step will eliminate the occasional interior tree vertex in v, all of whose neigh- 
bors are already in v. 

8.6.4 Find high probability nodes 

Step 3 computes the probability that a given neighbor is a part of the tree by 
computing how many neighbor slots each vertex in V has available (d™^ — dv^). 
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The number of available slots is then divided by the number of vertices in U2 that 
could potentially fill those slots (dvxu2) 

Wvxu2 = max((rf™2"'^ - dv2) . * v,0) .&/ dvxu2 

Note: max and .&/ are used to deal with out-of-bounds numerators (i.e., negative) 
and denominators (i.e., equal to zero). Sorting the above weights can be used to 
make an ordered ranking of each vertex in U2 

W^x'^ua = S0rt(Wvxu2) 

Determining how many vertices should be selected can be done by estimating 
the number of edges V should have that are in T {Ny and subtracting 

the edges that are actually observed 2 nnz{Av2). Further restricting the number of 
vertices to the highest 2/3 eliminates the lowest 1/3 of the probability distribution 
of weights (that are likely to be erroneous). The result is an estimate for the number 
of vertices in C/3 

Nu, = [(2/3) [iVy d:;'2"72 - 2 nnz{A^2)]-\ 

Selecting the top Njj^ of the sorted weights gives a threshold weight that can be 
used to select C/3 

U3 = (Wvxu2 >wr;4,(7V-7Vt;3)) 

8.6.5 Find high degree nodes 

U3 should contain many vertices in [/*, but it will also contain many vertices that 
are not. The probability of a random vertex having edges to two or more vertices 
in T is small. Therefore, selecting the vertices in U3 that are connected to more 
than one vertex reduces the set to vertices that are nearly all in U*. 
Let T3 be the current estimate of vertices in the tree 

t3 = V + U3 

Then, U4 is the set of vertices where U3 has more than one edge in T3 

U4 = (dt2. * U3 > 1) 

Finally, it is worth noting that, given U*, the number of correctly selected 
vertices can be computed at any stage f7, in the above process via the equation 
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Figure 8.8. PD versus PFA. 

Results from 10 Monte Carlo simulations of the tree detection algorithm 
are shown with crosses and the theoretical estimates are shown with 
circles. 



8.7 Results and conclusions 

The above tree-finding filter was applied to Monte Carlo simulations of a power law 
graph {N = 2^°, M ~ 8N) and a small tree (|t| = 2*"), where half the vertices in 
the tree were given in V (i.e., / = 1/2). At each stage in the algorithm, the number 
of correct detections is calculated along with the estimated SNR. The PD and PFA 
are shown in Figure 8.8 along with the theoretical PD and PFA calculations. The 
results indicate that the algorithm is achieving close to what is theoretically possible. 
What is particularly interesting is that certain classes of vertices in the tree can be 
detected with virtually zero false alarms. 

In conclusion, detecting subgraphs of interest in larger graphs is the goal of 
many graph analysis techniques. The basis of detection theory is computing the 
probability of a "foreground" with respect to a model of the "background" data. 
Combining these models allows estimates of the SNR, PD, and PFA for different 
classes of vertices in the foreground. These estimates can then be used to construct 
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filters for computing the probability that a background graph contains a particular 
foreground graph. This approach is successfully applied to the problem of detecting 
a partially labeled tree graph in a power law background graph. 
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Kronecker Graphs 
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Abstract 

How can we generate realistic networks? In addition, how can we do so 
with a mathcmaticahy tractable model that allows for rigorous analysis 
of network properties? Real networks exhibit a long list of surprising 
properties: heavy tails for the in- and out-degree distribution, heavy tails 
for the eigenvalues and eigenvectors, small diameters, and densification 
and shrinking diameters over time. We propose a generative model 
for networks that is mathematically tractable and produces the above- 
mentioned structural properties. Our model uses a standard matrix 
operation, the Kronecker product, to generate graphs, which we refer 
to as "Kronecker graphs," that are proved to naturally obey common 
network properties. Empirical evidence shows that Kronecker graphs 
can effectively model the structure of real networks. KronFit is a fast 
and scalable algorithm for fitting the Kronecker graph generation model 
to large real networks. KronFit takes linear time, by exploiting the 
structure of Kronecker matrix multiplication and by using statistical 
simulation techniques. Experiments on a range of networks show that 
KronFit finds parameters that mimic the properties of real networks. 
In fact, typically four parameters can accurately model several aspects of 
global network structure. Once fitted, the model parameters can be used 
to gain insights about the network structure, and the resulting synthetic 
graphs can be used for null-models, anonymization, extrapolations, and 
graph summarization. 

*Computer Science Depajrtment, Stanford University, Stanford, CA 94305-9040 (jureScs. 
Stanford, edu). 

Based on joint work with Deepayan Chakrabarti, Christos Faloutsos, Zoubin Ghahramani, 
and Jon Kleinberg. 

137 



Downloaded 09 Deo 2011 to 129.174.55.245. Redistribution subjeotto SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 



138 



Chapter 9. Kronecker Graphs 



9.1 Introduction 

What do real graphs look like? How do they evolve over time? How can we 
generate synthetic, but realistic-looking, time-evolving graphs? Recently, network 
analysis has been attracting much interest, with an emphasis on finding patterns 
and abnormalities in social networks, computer networks, e-mail interactions, gene 
regulatory networks, and many more. Most of the work focuses on static snapshots 
of graphs, where fascinating "laws" have been discovered, including small diameters 
and heavy-tailed degree distributions. 

In parallel with discoveries of such structural "laws," there has been work to 
find models of network formation that generate these structures. A good realistic 
network generation model is important for at least two reasons. The first is that 
such a model can generate graphs for extrapolations, hypothesis testing, "what-if" 
scenarios, and simulations, when real graphs are difficult or impossible to collect. 
For example, how well will a given protocol run on the Internet five years from now? 
Accurate network models can produce more realistic models for the future Internet, 
on which simulations can be run. The second reason is that these models require 
us to think about the network properties that generative models should obey to be 
realistic. 

In this chapter, we introduce Kronecker graphs, a generative network model 
that obeys all the main static network patterns that have appeared in the literature; 
see, for instance, [Faloutsos et al. 1999, Albert ct al. 1999, Chakrabarti et al. 2004, 
Farkas et al. 2001, Mihail & Papadimitriou 2002, Watts & Strogatz 1998]. Our model 
also obeys recently discovered temporal evolution patterns [Lcskovcc ct al. 2005b, 
Lcskovcc ct al. 2007a]. Contrary to other models that match this combination of 
network properties (as for example, [Bu & Towsley 2002, Klemm & Eguiluz 2002, 
Vazquez 2003, Lcskovcc et al. 2005b, Zhclcva ct al. 2009]), Kronecker graphs also 
lead to tractable analysis and rigorous proofs. Furthermore, the Kronecker graph 
generative process also has a nice natural interpretation and justification. 

Our model is based on a matrix operation, the Kronecker product. There 
are several known theorems on Kronecker products. They correspond exactly to 
a significant portion of what we want to prove: heavy-tailed distributions for in- 
degree, out-degree, eigenvalues, and eigenvectors. We also demonstrate how Kro- 
necker graphs can match the behavior of several real networks (social networks, 
citations, web, Internet, and others). While Kronecker products have been stud- 
ied by the algebraic combinatorics community (see, e.g., [Chow 1997, Imrich 1998, 
Imrich & Klavzar 2000, Hannnack 2009]), the present work is the first to employ 
this operation in the design of network models to match real data. 

Then we also make a step further and tackle the following problem: given a 
large real network, we want to generate a synthetic graph, so that the resulting 
synthetic graph matches the properties of the real network as well as possible. 

Ideally we would like (a) a graph generation model that naturally produces 
networks where many properties that are also found in real networks naturally 
emerge; (b) the model parameter estimation to be fast and scalable, so that we 
can handle networks with millions of nodes; and (c) the resulting set of parameters 
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to generate realistic-looking networks that match the statistical properties of the 
target, real networks. 

In general, the problem of modeling network structure presents several concep- 
tual and engineering challenges. Which generative model should we choose, among 
the many in the literature? How do we measure the goodness of the fit? (Least 
squares do not work well for power laws.) If we use likelihood, how do we estimate it 
better than 0{N'^) complexity? How do wc solve the node correspondence problem; 
i.e., which node of the real network corresponds to what node of the synthetic one? 

To answer the above questions, we present KronFit, a fast and scalable al- 
gorithm for fitting Kronecker graphs by using the maximum-likelihood principle. 
When one calculates the likelihood, there are two challenges. First, one needs to 
solve the node correspondence problem by matching the nodes of the real and the 
synthetic network. Essentially, one has to consider all mappings of nodes of the 
network to the rows and columns of the graph adjacency matrix. This becomes 
intractable for graphs with more than tens of nodes. Even when given the "true" 
node correspondences, just evaluating the likelihood is still prohibitively expensive 
for large graphs. We present solutions to both of these problems. We develop a 
Metropolis sampling algorithm for sampling node correspondences, and we approx- 
imate the likelihood to obtain a linear time algorithm for Kronecker graph model 
parameter estimation that scales to large networks with millions of nodes and edges. 
KronFit gives orders of magnitude speedups against older methods. 

Our extensive experiments on synthetic and real networks show that Kronecker 
graphs can efficiently model statistical properties of networks, such as degree dis- 
tribution and diameter, while using only four parameters. 

Once the model is fitted to the real network, there are several benefits and 
applications: 

(a) Network structure: The parameters give us insight into the global structure 
of the network itself. 

(b) Null-model: When working with network data, we would often like to assess 
the significance or the extent to which a certain network property is expressed. 
We can use a Kronecker graph as an accurate null-model. 

(c) Sim,ulations: Given an algorithm working on a graph, we would like to evaluate 
how its performance depends on various properties of the network. Using our 
model, one can generate graphs that exhibit various combinations of such 
properties and then evaluate the algorithm. 

(d) Extrapolations: We can use the model to generate a larger graph to help us 
understand how the network will look in the future. 

(e) Sampling: Conversely, we can also generate a smaller graph, which may be 
useful for running simulation experiments (e.g., simulating routing algorithms 
in computer networks, or virus/worm propagation algorithms) when these 
algorithms may be too slow to run on large graphs. 
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(f) Graph similarity: To compare the similarity of the structure of different net- 
works (even of different sizes), one can use the differences in estimated pa- 
rameters as a similarity measure. 

(g) Graph visualization and compression: We can compress the graph, by storing 
just the model parameters, and the deviations between the real and the syn- 
thetic graph. Similarly, for visualization purposes, one can use the structure 
of the parameter matrix to visualize the backbone of the network, and then 
display the edges that deviate from the backbone structure. 

(h) Anonymization: Suppose that the real graph cannot be publicized, e.g., cor- 
porate e-mail network or customer-product sales in a recommendation system. 
Yet, we would like to share our network. Our work gives ways to simulate 
such a realistic, "similar" network. 

The present chapter builds on our previous work on Kronecker graphs (see 
[Leskovec et al. 2005a, Leskovec & Faloutsos 2007]) and is organized as follows. Sec- 
tion 9.2 briefly surveys the related literature. In Section 9.3, we introduce the Kro- 
necker graph model and give formal statements about the properties of networks it 
generates. We investigate the model using simulations in Section 9.4 and continue 
by introducing KronFit, the Kronecker graphs parameter estimation algorithm, in 
Section 9.5. We present experimental results on a wide range of real and synthetic 
networks in Section 9.6. We close with discussion and conclusions in Sections 9.7 
and 9.8. 

9.2 Relation to previous work on network modeling 

Networks across a wide range of domains present surprising regularities, such as 
power laws, small diameters, communities, and so on. We use these patterns as 
sanity checks; that is, our synthetic graphs should match those properties of the 
real target graph. 

Most of the related work in this field has concentrated on two aspects: prop- 
erties and patterns found in real-world networks, and then ways to find models to 
build understanding about the emergence of these properties. First, we will dis- 
cuss the commonly found patterns in (static and temporally evolving) graphs, and 
finally, the state of the art in graph generation methods. 

9.2.1 Graph patterns 

Here we briefly introduce the network patterns (also referred to as properties or 
statistics) that we will later use to compare the similarity between the real networks 
and their synthetic counterparts produced by the Kronecker graphs model. While 
many patterns have been discovered, two of the principal ones are heavy-tailed 
degree distributions and small diameters. 

Degree distribution: The degree distribution of a graph is a power law if 
the number of nodes Nd with degree d is given by Nd oc d^^{'-f > 0), where 
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7 is called the power law exponent. Power laws have been found on the Inter- 
net [Faloutsos ot al. 1999], the web [Kleinl)crg ct al. 1999, Brodor ct al. 2000], ci- 
tation graphs [Redncr 1998], online social networks [Chakrabarti ct al. 2004], and 
many others. 

Small diameter: Most real- world graphs exhibit relatively small diameter (the 
"small- world" phenomenon, or "six degrees of separation" [Milgram 1967]). A 
graph has diameter D if every pair of nodes can be connected by a path of length 
at most D edges. The diameter D is susceptible to outliers. Thus, a more robust 
measure of the pairwise distances between nodes in a graph is the integer effective 
diameter [Taiu'o ct al. 2001], which is the minimum number of links (steps/hops) 
in which some fraction (or quantile g, say q = 0.9) of all connected pairs of nodes 
can reach each other. Here we make use of effective diameter, which we define 
as follows [Lcskovcc ct al. 2005b]. For each natural number h, let g{h) denote the 
fraction of connected node pairs whose shortest connecting path has length at most 
/i, i.e., at most h hops away. We then consider a function defined over all positive 
real numbers x by linearly interpolating between the points {h,g{h)) and {h + 
l,g{h -\- 1)) for each x, where h = [xj , and we define the effective diameter of the 
network to be the value x at which the function g{x) achieves the value 0.9. The 
effective diameter has been found to be small for large real-world graphs, such as the 
Internet, web, and online social networks [Alljcrt & Barabasi 2002, Milgram 1967, 
Leskovcc ct al. 20051)]. 

Hop plot: It extends the notion of diameter by plotting the number of reachable 
pairs g{h) within h hops, as a function of the number of hops h [Palmer ot al. 2002]. 
It gives us a sense of how quickly nodes' neighborhoods expand with the number of 
hops. 

Scree plot: This is a plot of the eigenvalues (or singular values) of the graph 
adjacency matrix, versus their rank, using the logarithmic scale. The scree plot 
is also often found to approximately obey a power law [Chakrabarti ct al. 2004, 
Parkas (^t al. 2001]. Moreover, this pattern was also found analytically for random 
power law graphs [Chung ct al. 2003, Mihail & Papadimitriou 2002]. 

Network values: The distribution of eigenvector components (indicators of 
"network value") associated with the largest eigenvalue of the graph adjacency 
matrix has also been found to be skewed [Chakrabarti et al. 2004]. 

Node triangle participation: Edges in real-world networks and especially in 
social networks tend to cluster [Watts »*v Stroci,atz lf)98] and form triads of con- 
nected nodes. Node triangle participation is a measure of transitivity in networks. 
It counts the number of triangles a node participates in, i.e., the number of con- 
nections between the neighbors of a node. The plot of the number of triangles A 
versus the number of nodes that participate in A triangles has also been found to 
be skewed [Tsourakakis 200f]. 

Densification power law: The relation between the number of edges M{t) and 
the number of nodes N{t) in an evolving network at time t obeys the densification 
power law (DPL), which states that M{t) cx N{t)"' . The densification exponent a is 
typically greater than 1 , implying that the average degree of a node in the network 
is increasing over time (as the network gains more nodes and edges). Densification 
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implies that real networks tend to sprout many more edges than nodes, and thus 
densify as they grow [Loskcn-tx' c^t al. 2005b, Lcskovoc ct al. 2007a]. 

Shrinking diameter: The effective diameter of graphs tends to shrink or sta- 
bilize as the number of nodes in a network grows over time (see, for instance, 
[Lc'skovoc ct al. 2005b, Lcskovcc ct al. 2007a]). Diameter shrinkage is somewhat 
counterintuitive since, from common experience, one would expect that as the vol- 
ume of the object (a graph) grows, the size (i.e., the diameter) would also grow. 
But for real networks, this does not hold as the diameter shrinks and then seems to 
stabilize as the network grows. 

9.2.2 Generative models of network structure 

The earliest probabilistic generative model for graphs was the Erdos-Renyi random 
graph model [Erdos t A. Rcnyi lOGO], where each pair of nodes has an identical, 
independent probability of being joined by an edge. The study of this model has led 
to a rich mathematical theory. However, as the model was not developed to model 
real-world networks, it produces graphs that fail to match real networks in a number 
of respects (for example, it does not produce heavy-tailed degree distributions). 

The vast majority of recent network models involve some form of preferential 
attachment (see, for instance, [Barabasi & Albert 1999, Albert & Barabasi 2002, 
Winick & Jamin 2002, Kleinbcrg ct al. 1999, Kumar ct al. 1999, Flaxman ct al. 2007]) 
that employs a simple rule: a new node joins the graph at each time step, and then 
creates a connection to an existing node u with a probability proportional to the 
degree of the node u. This rule creates a "rich get richer" phenomenon and power 
law tails in degree distribution. Typically, the diameter in this model grows slowly 
with the number of nodes N, violating the "shrinking diameter" property mentioned 
above. 

There are many variations of preferential attachment model, all somehow 
employing the "rich get richer" type mechanism, e.g., the "copying model" (see 
[Kumar ct al. 2000]), the "winner does not take all" model [Pcnnock ct al. 2002], 
the "forest fire" model [Lcskovcc ct al. 2005b], the "random surfer model" (see 
[Bhun ct al. 2006]), etc. 

A different family of network methods strives for small diameter and local 
clustering in networks. Examples of such models include the Waxman genera- 
tor [Waxman 1988] and the small-world model [Watts & Strogatz 1998]. Another 
family of models shows that heavy tails emerge if nodes try to optimize their connec- 
tivity under resource constraints; see [Carlson & Doyle 1999, Fabrikant ct al. 2002]. 

In summary, most current models focus on modeling only one (static) network 
property. In addition, it is usually hard to analytically deduce properties of the 
network model. 

9.2.3 Parameter estimation of network models 

Until recently, relatively little effort was made to fit the above network models to 
real data. One of the difficulties is that most of the above models usually define a 
mechanism or a principle by which a network is constructed, and thus parameter 
estimation is either trivial or almost impossible. 
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Most work in estimating network models comes from the areas of social sci- 
ences, statistics, and social network analysis in which the exponential random graphs, 
also known as p* model, were introduced [Wasserman & Pattison 1996]. The model 
essentially defines a log linear model over all possible graphs G, p{G\9) oc exp(6'-^s(G)), 
where G is a graph, and s is a set of functions, that can be viewed as summary 
statistics for the structural features of the network. The p* model usually focuses on 
"local" structural features of networks (e.g., characteristics of nodes that determine 
a presence of an edge, link reciprocity, etc.). As exponential random graphs have 
been very useful for modeling small networks, and individual nodes and edges, our 
goal here is different in the sense that we aim to accurately model the structure of 
the network as a whole. Moreover, we aim to model and estimate parameters of 
networks with millions of nodes while, even for graphs of small size (> 100 nodes), 
the number of model parameters in exponential random graphs usually becomes 
too large, and estimation is prohibitively expensive, both in terms of computational 
time and memory. 

Regardless of a particular choice of a network model, a common theme when 
estimating the likelihood P{G) of a graph G under some model is the challenge of 
finding the correspondence between the nodes of the true network and its synthetic 
counterpart. The node correspondence problem results in the factorially many 
possible matchings of nodes. One can think of the correspondence problem as 
a test of graph isomorphism. Two isomorphic graphs G and G' with differently 
assigned node IDs should have the same likelihood P{G) — P{G'), so we aim to 
find an accurate mapping between the nodes of the two graphs. 

An ordering or a permutation defines the mapping of nodes in one network 
to nodes in the other network. For example. Butts [Butts 2IJU5] used permuta- 
tion sampling to determine similarity between two graph adjacency matrices, while 
Bezakova Kalai, and Santhanam [Bezakova ct al. 2006] used permutations for graph 
model selection. Recently, an approach for estimating parameters of the "copying" 
model was introduced (see [\Muf c\ al, 2006]); however, authors also noted that the 
class of "copying" models may not be rich enough to accurately model real net- 
works. As we show later, the Kronecker graph model seems to have the necessary 
expressive power to mimic real networks well. 

9.3 Kronecker graph model 

The Kronecker graph model we propose here is based on a recursive construction. 
Defining the recursion properly is somewhat subtle, as a number of standard, re- 
lated graph construction methods fail to produce graphs that density according to 
the patterns observed in real networks, and they also produce graphs whose diame- 
ters increase. To produce densifying graphs with constant/shrinking diameter, and 
thereby match the qualitative behavior of a real network, we develop a procedure 
that is best described in terms of the Kronecker product of matrices. 

9.3.1 Main idea 

The main intuition behind the model (see Table 9.1) is to create self-similar graphs, 
recursively. We begin with an initiator graph Ki, with N nodes and M edges, 
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Table 9.1. Table of symbols. 



Symbol 


Description 


G 






Real network 




N 






Number of nodes in G 




M 






Number of edges in G 




K 






Kronecker graph (syntlietic estimate of G) 










Initiator of a Kronecker graphs 




N 






Number of nodes in initiator Ki 




ivl 






Number of edges in Ki (the expected number of edges in T'l, M 




G (gi 


H 




Kronecker product of adjacency matrices of graphs G and H 






= Kj; = 


K 


fcth Kronecker power of Ki 




Ki[i 


,'1 




Entry at row i and column j of Ki 




e = 


Vi 




Stochastic Kronecker initiator 






= ru = 


V 


fcth Kronecker power of 'Pi 




Oij = 


--Pi[i,j] 




Entry at row i and column j of "Pi 




Pij = 


--VkliJ] 




Probability of an edge in "P^, i.e., entry at row i and column 


j of Pk 


K = 


R(P) 




Realization of a stochastic Kronecker graph "P 




m 






Log-likelihood. Log-prob. that 0 generated real graph G, logP(G|0) 


e 






Parameters at maximum likelihood, © = argmaxg, P(G|0) 










Permutation that maps node IDs of G to those of P 




a 






Densification power law exponent, M{t) oc N(t)'^ 




D 






Diameter of a graph 




iVc 






Number of nodes in the largest weakly connected component of a 


graph 








Proportion of times SwapNodes permutation proposal distribution 


is used 



and by recursion we produce successively larger graphs K2,K3, . . . such that the 
kth graph Kfe is on Nk = N'' nodes. If wc want these graphs to exhibit a version 
of the densification power law [Loskovcc et al. 2(J05b], then should have Alk — 
edges. This is a property that requires some care in order to get right, as 
standard recursive constructions (for example, the traditional Cartesian product or 
the construction of Barabasi [Barabasi et al. 2001]) do not yield graphs satisfying 
the densification power law. 

It turns out that the Kronecker product of two matrices is the right tool for 
this goal. The Kronecker product is defined as follows. 

Definition 9.1 (Kronecker product of matrices). Given two matrices A = [ai.j] 
and B of sizes n x m and n' x m' , respectively, the Kronecker product matrix C of 
dimensions (n ■ n') x (m • m') is given by 



C = A«)B ^ 



/ ai^iB ai,2B 
iB a2,2B 

\ a„,iB a„_2B 



ai,mB \ 
a2.mB 



^n,mB J 



(9.1) 



We then define the Kronecker product of two graphs simply as the Kronecker 
product of their corresponding adjacency matrices. 
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Definition 9.2 (Kronecker product of graphs [vVeichsel 1962]). If G and 

H are graphs with adjacency matrices A(G) and A{H), respectively, then the Kro- 
necker product G ® H is defined as the graph with adjacency matrix A{G)® A[H). 

Observation 1 (Edges in Kronecker- multiplied graphs). 

Edge {Xij,Xki) (z G (g) H iff {X„Xk) G G and {Xj,Xi) G H 

where Xij and Xki are nodes in G ® H , and Xi, Xj, Xk, and Xi are the corre- 
sponding nodes in G and H, as in Figure 9.1. 

The last observation is crucial and deserves elaboration. Basically, each node 
in G <Si H can be represented as an ordered pair Xij , with i a node of G and j 
a node of H, and with an edge joining Xij and Xki precisely when (Xi,Xk) is 
an edge of G and {Xj,Xi) is an edge of H. This is a direct consequence of the 
hierarchical nature of the Kronecker product. Figure 9.1(a)-(c) further illustrates 
this by showing the recursive construction oi G(Ei H , when G = H is a, 3-node chain. 
Consider node X12 in Figure 9.1(c): it belongs to the H graph that replaced node 
Xi (see Figure 9.1(b)) and, in fact, is the X2 node (i.e., the center) within this 
small H graph. 

We propose to produce a growing sequence of matrices by iterating the Kro- 
necker product. 

Definition 9.3 (Kronecker power). The kth power of Ki is defined as the 
matrix 'K.f' (abbreviated to J^k), such that 

Kf" = Kfe = Ki 0 Ki 0 ■ ■ ■ 0 Ki = Kk-i (g) Ki 
k times 



Definition 9.4 (Kronecker graph). Kronecker graph of order k is defined by the 
adjacency matrixlKf^, where Ki is the Kronecker initiator adjacency matrix. 

The self-similar nature of the Kronecker graph product is clear: to produce 
Kfc from Kfe_i, we "expand" (replace) each node of K^^-i by converting it into a 
copy of Ki, and we join these copies together according to the adjacencies in K^-i 
(see Figures 9.1, 9.2, and 9.3). This process is very natural: one can imagine it 
as positing that communities within the graph grow recursively, with nodes in the 
community recursively getting expanded into miniature copies of the community. 
Nodes in the subcommunity then link among themselves and also to nodes from 
other communities. 

Note that there are many different names to refer to the Kronecker product 
of graphs. Other names for the Kronecker product are tensor product, categorical 
product, direct product, cardinal product, relational product, conjunction, weak di- 
rect product or just product, and even Cartesian product [Imrich & Klavzar 2UUU]. 
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(a) Graph Ki 



1 


1 


0 


1 


1 


1 


0 


1 


1 



(d) Adjacency matrix 
of Ki 




Central node is X , 



(b) Intermediate stage (c) Graph K2 = Ki ® Ki 







0 








0 







(e) Adjacency matrix 
of K2 = Ki ® Ki 



Figure 9.1. Example of Kronecker multiplication. 

Top: a "3-chain" initiator graph and its Kronecker product with it- 
self. Each of the Xi nodes gets expanded into 3 nodes, which are then 
linked using Observation 1. Bottom row: the corresponding adjacency 
matrices. See Figure 9.2 for adjacency matrices of K3 and K4. 



(a) K3 adjacency matrix (27 x 27) (b) K4 adjacency matrix (81 x 81) 



Figure 9.2. Adjacency matrices of K3 and K4. 

The third and fourth Kronecker power of Ki matrix as defined in Fig- 
ure 9.1. Dots represent nonzero matrix entries, and white space repre- 
sents zeros. Notice the recursive self-similar structure of the adjacency 
matrix. 
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1 


1 


1 


1 


1 


1 


0 


0 


1 


0 


1 


0 


1 


0 


0 


1 




1 


1 


1 


1 


1 


1 


0 


0 


1 


0 


1 


1 


1 


0 


1 


1 



Initiator Ki Ki adjacency matrix 



K3 adjacency matrix 



Figure 9.3. Self-similar adjacency matrices. 

Two examples of Kronecker initiators on 4 nodes and the self-similar 
adjacency matrices they produce. 



9.3.2 Analysis of Kronecker graphs 

We shall now discuss the properties of Kronecker graphs, specifically, their degree 
distributions, diameters, eigenvalues, eigenvectors, and time evolution. The ability 
to prove analytical results about all of these properties is a major advantage of 
Kronecker graphs. 

Degree distribution 

The next few theorems prove that several distributions of interest are multino- 
mial for our Kronecker graph model. This is important, because a careful choice 
of the initial graph Ki makes the resulting multinomial distribution behave like 
a power law or discrete Gaussian exponential (DGX) distribution [Bi ct al. 2001, 
Clausct et al. 2007]. 
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Theorem 9.5 (Multinomial degree distribution). Kronecker graphs have 
multinomial degree distributions, for both in- and out-degrees. 

Proof. Let the initiator Ki have the degree sequence di,d2, ■ ■ ■ ,dMi- Kronecker 
multiplication of a node with degree d expands it into TVi nodes, with the cor- 
responding degrees being d x di, d x d2, . . . , d x d^i- After Kronecker powering, 
the degree of each node in graph is of the form di^ x di^ x ■ ■ ■ x di^ , with 
ii,«2,---7ife G (li---i-^i)i and there is one node for each ordered combination. 
This gives us the multinomial distribution on the degrees of Kj,. So, graph K/c will 
have multinomial degree distribution where the "events" (degrees) of the distribu- 
tion will be combinations of degree products: dYd2 ■ ■ ■ d^^^ (where J2fli *i = ^) 
and event (degree) probabilities will be proportional to (^j^j^'^iiv )' ^'^^^ ^^^^ thaX 
this is equivalent to noticing that the degrees of nodes in can be expressed as 
the fcth Kronecker power of the vector (di, d2, ■ ■ ■ , dN-^). □ 



Spectral properties 

Next we analyze the spectral properties of an adjacency matrix of a Kronecker 
graph. We show that both the distribution of eigenvalues and the distribution of 
component values of eigenvectors of the graph adjacency matrix follow multinomial 
distributions. 

Theorem 9.6 (Multinomial eigenvalue distribution). The Kronecker graph 
Kfe has a multinomial distribution for its eigenvalues. 

Proof. Let Ki have the eigenvalues Ai, A2, . . . , Ajv^ . By properties of the Kro- 
necker multiplication [Van Loan 2000, Langvilk^ Stewart 2004], the eigenvalues 
of Kfe are the kth Kronecker power of the vector of eigenvalues of the initiator ma- 
trix, (Ai, A2, . . . , Aatj)'^'^. As in Theorem 9.5, the eigenvalue distribution is a multi- 
nomial. □ 

A similar argument using properties of Kronecker matrix multiplication shows 
the following. 

Theorem 9.7 (Multinomial eigenvector distribution). The components of 
each eigenvector of the Kronecker graph Kj, follow a multinomial distribution. 

Proof. Let Ki have the eigenvectors vi,V2, . . . ,vni- By properties of the Kronecker 
multiplication [Va)) Loan 2000, Lan,a,villo t^' Stewart 2004], the eigenvectors of Kfc 
are given by the fcth Kronecker power of the vector: (t/i, V2, ■ . ■ , vn), which gives a 
multinomial distribution for the components of each eigenvector in K/j. □ 

We have just covered several of the static graph patterns. Notice that the 
proofs were a direct consequences of the Kronecker multiplication properties. 
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Figure 9.4. Graph adjacency matrices. 

Dark parts represent connected (filled with ones) and white parts repre- 
sent empty (filled with zeros) parts of the adjacency matrix, (a) When 
G is disconnected, Kronecker multiplication with any matrix H will re- 
sult in G (E) H being disconnected, (b) Adjacency matrix of a connected 
bipartite graph G with node partitions A and B. (c) Adjacency matrix 
of a connected bipartite graph G with node partitions G and D. (e) 
Kronecker product of two bipartite graphs G and H. (d) After rear- 
ranging the adjacency matrix G (E) H, we clearly see that the resulting 
graph is disconnected. 



Connectivity of Kronecker graphs 

We now present a series of results on the connectivity of Kronecker graphs. We show, 
maybe a bit surprisingly, that even if a Kronecker initiator graph is connected, its 
Kronecker power can, in fact, be disconnected. 

Lemma 9.8. // at least one of G and H is a disconnected graph, then G ® H is 
also disconnected. 

Proof. Without loss of generality, we can assume that G has two connected compo- 
nents, while H is connected. Figure 9.4(a) illustrates the corresponding adjacency 
matrix for G. Using the notation from Observation 1, let graph G have nodes 
Xi, . . . , Xn, where nodes {Xi, . . . Xr} and {Xr+i, . . . , Xn} form the two connected 
components. Now, note that {Xij,Xki) ^ G®H for i e {1, . . . , r}, fc e . . . , n}. 
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and for all j, I. This follows directly from Observation 1 as {Xi,Xk) are not edges 
in G. Thus, G ® H must have at least two connected components. □ 

Actually it turns out that both G and H can be connected while G ® H is 
disconnected. The following theorem analyzes this case. 

Theorem 9.9. // both G and H are connected but bipartite, then G ® H is discon- 
nected, and each of the two connected components is again bipartite. 

Proof. Without loss of generality, let G be bipartite with two partitions A = 
{Xi, . . . , X,.} and B = {X^+i, ■ ■ ■ , Xn}, where edges exist only between the par- 
titions and no edges exist inside the partition: {Xi,Xk) ^ G ior i,k e A or 
i,k € B. Similarly, let H also be bipartite with two partitions G = {Xi, . . . , Xs} 
and D — {Xg+i, . . . ,Xm}- Figures 9.4(b) and 9.4(c) illustrate the structure of the 
corresponding adjacency matrices. 

Now, there will be two connected components in G ® H: the first component 
will be composed of nodes {Xij} € G ® H, where (i G A, j € D) or [i & B,j £ G). 
And similarly, the second component will be composed of nodes {Xij}, where 
{i € A,j S C) or {i S B,j G D). Basically, there exist edges between node 
sets {A,D) and {B,G), and similarly between {A,G) and {B,D) but not across 
the sets. To see this, we have to analyze the cases using Observation 1. For 
example, G ® H there exist edges between nodes {A, C) and [B, D) as there 
exist edges (z,fc) € G iov i € A,k € B, and G H for j G G and / £ D. 

Similarly, it is true for nodes {A,C) and {B,D). However, no edges cross the 
two sets, e.g., nodes from {A,D) do not link to {A,G), as there are no edges be- 
tween nodes in A (since G is bipartite). See Figures 9.4(d) and 9.4(e) for a visual 
proof. □ 

Note that bipartite graphs are triangle free and have no self-loops. Stars, 
chains, trees, and cycles of even length are all examples of bipartite graphs. In 
order to ensure that K/j is connected, for the remainder of the chapter we focus on 
initiator graphs Ki with self-loops on all of the vertices. 



Temporal properties of Kronecker graphs 

We continue with the analysis of temporal patterns of evolution of Kronecker graphs: 
the densification power law and shrinking/stabilizing diameter (see, for instance, 
[Lcskovec ct al. 2005b, Lcskovcc ct al. 2007a]). 

Theorem 9.10 (Densification power law). Kronecker graphs follow the densi- 
fication power law (DPL) with densification exponent a = log(Af)/ log(iV). 

Proof. Since the fcth Kronecker power has = N'^ nodes and Mk ~ 
edges, it satisfies Mk — N^, where a ~ \og{M) / \og{N) . The crucial point is that 
this exponent a is independent of fc, and hence the sequence of Kronecker powers 
follows an exact version of the densification power law. □ 
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We now show how the Kronecker product also preserves the property of con- 
stant diameter, a crucial ingredient for matching the diameter properties of many 
real-world network data sets. In order to establish this, we will assume that the 
initiator graph Ki has a self-loop on every node. Otherwise, its Kronecker powers 
may be disconnected. 

Lemma 9.11. If G and H each have diameter at most D and each has a self-loop 
on every node, then the Kronecker graph G ® H also has diameter at most D. 

Proof. Each node in G ® H can be represented as an ordered pair {v,w)^ with 
V a node of G and w a node of H, and with an edge joining {v,w) and {x,y) 
precisely when {v,x) is an edge of G and {w,y) is an edge of H. (Note this is 
exactly Observation 1.) Now, for an arbitrary pair of nodes {vjw) and {v',w'), we 
must show that there is a path of length at most D connecting them. Since G has 
diameter at most D, there is a path v = vi,V2, ■ ■ ■ ,Vr = w', where r < D. If r < D, 
we can convert this into a path v — vi,V2, ■ ■ ■ ,vd — v' of length exactly D by 
simply repeating v' at the end for D — r times. By an analogous argument, we have 
a path w = wi,W2, ■ ■ ■ , wd = w' . Now by the definition of the Kronecker product, 
there is an edge joining (vi,Wi) and (ui+i, Wi+i) for all 1 < i < I? — 1, and so 
(v, w) = {vi,wi), (u2, W2), • . . , (w_D, Wd) = (v' , w') is a path of length D connecting 
(11, w) to {v',w'), as required. □ 

Theorem 9.12. // Ki has diameter D and a self-loop on every node, then for 
every k, the graph Kfc also has diameter D. 

Proof. This follows directly from the previous lemma, combined with induction on 
k. □ 

Define the g-effective diameter as the minimum D* such that, for a q fraction 
of the reachable node pairs, the path length is at most D* . The q-effective diameter 
is a more robust quantity than the diameter, the latter being prone to the effects of 
degenerate structures in the graph, e.g., very long chains). However, the g-effective 
diameter and diameter tend to exhibit qualitatively similar behavior. For reporting 
results in subsequent sections, we will generally consider the g-effective diameter 
with q — 0.9 and refer to this simply as the effective diameter. 

Theorem 9.13 (Effective diameter). //Ki has diameter D and a self-loop on 
every node, then for every q, the q-effective diameter of converges to D (from 
below) as k increases. 

Proof. To prove this, it is sufficient to show that for two randomly selected nodes 
of Kfe, the probability that their distance is D converges to 1 as fc goes to infinity. 

We establish this as follows. Each node in can be represented as an 
ordered sequence of k nodes from Ki, and we can view the random selection of 
a node in as a sequence of k independent random node selections from Ki. 
Suppose that V = [vi, . . . ,Vk) and w = (wi, . . . ,Wk) are two such randomly selected 
nodes from K^. Now, if x and y are two nodes in Ki at distance D (such a 
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pair {x^y) exists since Ki has diameter D), then with probability 1 — (1 — "W^Y i 
there is some index j for which {vj,Wj} — If there is such an index, 

then the distance between v and w \s D. As the expression 1 — (1 — ■^)'^ con- 
verges to 1 as A: increases, it follows that the g-efFective diameter is converging 
to D. □ 

9.3.3 Stochastic Kronecker graphs 

While the Kronecker power construction discussed so far yields graphs with a range 
of desired properties, its discrete nature produces "staircase effects" in the degrees 
and spectral quantities, simply because individual values have large multiplicities. 
For example, degree distribution and distribution of eigenvalues of graph adja- 
cency matrix and the distribution of the principal eigenvector components (i.e., the 
"network" value) are all impacted by this. These quantities are multinomially dis- 
tributed, leading to individual values with large multiplicities. Figure 9.5 illustrates 
the staircase effect. 

Here we propose a stochastic version of Kronecker graphs that eliminates this 
effect. There are many possible ways one could introduce stochasticity into the 
Kronecker graph model. Before introducing the proposed model, we introduce two 
simple ways of introducing randomness to Kronecker graphs and describe why they 
do not work. 

Probably the simplest (but wrong) idea is to generate a large deterministic 
Kronecker graph Kfc, and then uniformly at random flip some edges, i.e., uniformly 
at random select entries of the graph adjacency matrix and flip them (1 — )• 0, 0 1). 
However, this will not work, as it will essentially superimpose an Erdos-Renyi 
random graph, which would, for example, corrupt the degree distribution — real 
networks usually have heavy-tailed degree distributions while random graphs have 
binomial degree distributions. A second idea could be to allow a weighted initiator 
matrix, i.e., values of entries of Ki are not restricted to values {0, 1} but rather can 
be any nonnegative real number. Using such Ki, one would generate and then 
threshold the matrix to obtain a binary adjacency matrix K, i.e., for a chosen 
value of e set K[i,i] = 1 if Kfe[i,j] > e else K[i,j] — 0. This mechanism would 
selectively remove edges, and low degree nodes would get isolated first. 

Now we define the stochastic Kronecker graph model, which overcomes the 
above issues. A more natural way to introduce stochasticity to Kronecker graphs is 
to relax the assumption that entries of the initiator matrix take only binary values. 
Instead, we allow entries of the initiator to take values on the interval [0, 1]. Now 
each entry of the initiator matrix encodes the probability of that particular edge 
appearing. We then Kronecker-power such an initiator matrix to obtain a large 
stochastic adjacency matrix, where again each entry of the large matrix gives the 
probability of that particular edge appearing in a big graph. Such a stochastic 
adjacency matrix defines a probability distribution over all graphs. To obtain a 
graph, we simply sample an instance from this distribution by sampling individual 
edges, where each edge appears independently with probability given by the entry 
of the large stochastic adjacency matrix. More formally, we define the Stochastic 
Kronecker graph as follows. 
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Figure 9.5. The "staircase" effect. 

Top: Kronecker initiator Ki. Middle: degree distribution of Kg (6th 
Kronecker power of Ki). Bottom: network value of Kg (6th Kronecker 
power of Ki). Notice the nonsmoothness of the curves. 
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Definition 9.14 (Stochastic Kronecker graph). Let Vi be an NxN probability 
matrix; the value 9ij G Vi denotes the probability that edge {i,j) is present, Oij G 
[0,1]. 

Then the kth Kronecker power Vf^ — Vk, where each entry puv G Vk encodes 
the probability of an edge (m, v). 

To obtain a graph, an instance (or realization^ K ~ R{'Pk), we include edge 
(u, v) in K with probability Puv, Puv € Vk- 

First, note that the sum of the entries of Vi, J^ij '^^^ greater than 1. 
Second, notice that in principle it takes 0{N'^^) time to generate an instance K of 
a stochastic Kronecker graph from the probability matrix Vk- This means the time 
to get a realization K is quadratic in the size of Vk as one has to flip a coin for each 
possible edge in the graph. Later we show how to generate stochastic Kronecker 
graphs much faster, in the time linear in the expected number of edges in Vk- 



Probability of an edge 

For the size of graphs we aim to model and generate here, taking Vi (or Ki) and 
then explicitly performing the Kronecker product of the initiator matrix is infeasible. 
The reason is that Vi is usually dense, so Vk is also dense and one cannot explicitly 
store it in memory to directly iterate the Kronecker product. However, due to the 
structure of Kronecker multiplication, one can easily compute the probability of an 
edge in Vk- 

The probability Puv of an edge (u, v) occurring in kth Kronecker power V = Vk 
can be calculated in 0{k) time as follows 



fc-i 



1 



(modiV) + 1 



(9.2) 



The equation imitates recursive descent into the matrix V, where at every 
level i the appropriate entry of Vi is chosen. Since V has N'' rows and columns, it 
takes 0{k\ogN) to evaluate the equation. Refer to Figure 9.6 for the illustration 
of the recursive structure of V. 



9.3.4 Additional properties of Kronecker graphs 

Stochastic Kronecker graphs with initiator matrix of size N — 2 were studied by 
Mahdian and Xu [Alalidian ts: Xu 20U7]. The authors showed a phase transition for 
the emergence of the giant component and another phase transition for connectivity, 
and they proved that such graphs have constant diameters beyond the connectivity 
threshold, but are not searchable using a decentralized algorithm [Klcinborg 1999]. 

A general overview of the Kronecker product is given in [Imrich &z Klavzar 2000] , 
and properties of Kronecker graphs related to graph minors, planarity, cut vertex. 
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Figure 9.6. Stochastic Kronecker initiator. 

The initiator matrix Vi and the corresponding 2nd Kronecker power 
V2- Notice the recursive nature of the Kronecker product, with edge 
probabilities in V2 simply being products of entries of Vi . 

and cut edge have been explored in [Bottrcau & Mctivicr 1998]. Moreover, recently 
Tsourakakis [Tsourakakis 2008] gave a closed form expression for the number of 
triangles in a Kronecker graph that depends on the eigenvalues of the initiator 
graph Ki. 

9.3.5 Two interpretations of Kronecker graphs 

Next, we present two natural interpretations of the generative process behind the 
Kronecker graphs that go beyond the purely mathematical construction of Kro- 
necker graphs as introduced so far. 

We already mentioned the first interpretation when we first defined Kronecker 
graphs. One intuition is that networks are hierarchically organized into commu- 
nities (clusters). Communities then grow recursively, creating miniature copies 
of themselves. Figure 9.1 depicts the process of the recursive community expan- 
sion. In fact, several researchers have argued that real networks are hierarchi- 
cally organized (see, for instance, [Ravasz ct al. 20U2, Ravasz Barabasi 2003]), 
and algorithms to extract the network hierarchical structure have also been devel- 
oped [Sales-Pardo et al. 2007, Clauset et al. 2009]. Moreover, especially web graphs 
[Dill et al. 2002, Dorogovtsev et al. 2002, Crovella cV: Bostavros 1997] and biological 
networks [Ra^-asz i^- Barabasi 2003] were found to be self-similar and "fractal." 

The second intuition comes from viewing every node of Vk as being described 
with an ordered sequence of k nodes from Vi . (This is similar to Observation 1 and 
the proof of Theorem 9.13.) 

Let's label nodes of the initiator matrix Vi as wi, . . . , mat, and nodes of Vk as 
Ui, . . . , Vffk . Then every node Vi of Vk is described with a sequence (wi(l), . . . , Vi{k)) 
of node labels of Vi, where Vi{l) G {ui, . . . ,Uk}- Similarly, consider also a second 
node Vj with the label sequence {vj{l), . . . ,Vj{k)). Then the probability pe of an 
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edge (vi^Vj) in Vk is exactly 

fe 

1=1 

(Note this is exactly equation (9.2).) 

Now one can look at the description sequence of node Vi as a fc-dimensional vec- 
tor of attribute values (ui(l), . . . , Vi{k)). Then Pe{vi, Vj) is exactly the coordinate- 
wise product of appropriate entries of Vi, where the node description sequence 
selects which entries of Vi to multiply. Thus, the Vi matrix can be thought of as 
the attribute similarity matrix; i.e., it encodes the probability of linking given that 
two nodes agree/disagree on the attribute value. Then the probability of an edge is 
simply a product of individual attribute similarities over the k iV-valued attributes 
that describe each of the two nodes. 

This intuition gives us a very natural interpretation of stochastic Kronecker 
graphs. Each node is described by a sequence of categorical attribute values or 
features, and then the probability of two nodes linking depends on the product of 
individual attribute similarities. This way, Kronecker graphs can effectively model 
homophily (nodes with similar attribute values are more likely to link) by Vi having 
high-value entries on the diagonal, or heterophily (nodes that differ are more likely 
to link) by Vi having high entries off the diagonal. 

Figure 9.6 shows an example. Let's label nodes of Vi as ui,U2 as in Fig- 
ure 9.6(a). Then every node of Vk is described with an ordered sequence of k bi- 
nary attributes. For example. Figure 9.6(b) shows an instance for fc = 2 where 
node V2 of 1^2 is described by (ui,U2), and similarly by (u2,ui)- Then as 
shown in Figure 9.6(b), the probability of edge Pe(w2,f3) = 6 • c, which is exactly 
Vi [u2, ui] ■ Vi [ui , U2] —h - c — the product of entries of Vi , where the corresponding 
elements of the description of nodes V2 and W3 act as selectors of which entries of 
Vi to multiply. 

Figure 9.6(c) further illustrates the recursive nature of Kronecker graphs. One 
can see the Kronecker product as recursive descent into the big adjacency matrix 
where at each stage one of the entries or blocks is chosen. For example, to get 
to entry {v2,V3), one first needs to dive into quadrant b followed by quadrant c. 
This intuition will help us in Section 9.3.6 to devise a fast algorithm for generating 
Kronecker graphs. 

However, there are also two notes to make here. First, using a single initiator 
Vi, we are implicitly assuming that there is one single and universal attribute 
similarity matrix that holds across all k N-ary attributes. One can easily relax 
this assumption by taking a different initiator matrix for each attribute (initiator 
matrices can even be of different sizes as attributes are of different arity), and then 
Kronecker-multiplying them to obtain a large network. Here each initiator matrix 
plays the role of attribute similarity matrix for that particular attribute. 

For simplicity and convenience, we will work with a single initiator matrix, 
but all our methods can be trivially extended to handle multiple initiator matrices. 
Moreover, as we will see later in Section 9.6, even a single 2x2 initiator matrix seems 
to be enough to capture large-scale statistical properties of real- world networks. 
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The second assumption is harder to relax. When describing every node Vi 
with a sequence of attribute values, we are implicitly assuming that the values of 
all attributes are uniformly distributed (have same proportions) and that every 
node has a unique combination of attribute values. So, all possible combinations of 
attribute values are taken. For example, node vi in a large matrix Vk has attribute 
sequence (wi, ui, . . . , ui), and ujv has (wi, ui, . . . , ui, ujv), while the "last" node 
Vj^k has attribute values {un,un, ■ ■ ■ tUn)- One can think of this as counting in 
the iV-ary number system, where node attribute descriptions range from 0 (i.e., 
"leftmost" node with attribute description (ui, ui, . . . , ui)) to N'' (i.e., "rightmost" 
node attribute description (un^un, . . . ,un))- 

A simple way to relax the above assumption is to take a larger initiator matrix 
with a smaller number of parameters than the number of entries. This means 
that multiple entries of Vi will share the same value (parameter). For example, 
if attribute ui takes one value 66% of the time, and the other value 33% of the 
time, then one can model this by taking a 3 x 3 initiator matrix with only four 
parameters. Adopting the naming convention of Figure 9.6, we see that parameter 
a now occupies a 2 x 2 block, which then also makes b and c occupy 2x1 and 1x2 
blocks, and d a single cell. This way one gets a four-parameter model with uneven 
feature value distribution. 

We note that the view of Kronecker graphs in which every node is described 
with a set of features and the initiator matrix encodes the probability of linking given 
the attribute values of two nodes somewhat resembles the random dot product graph 
model [Youiic; X' Sch(-iiicrmaii 2007, Nickel 2(jU8]. The important difference here 
is that we multiply individual linking probabilities, while in random dot product 
graphs one takes the sum of individual probabilities, which seems somewhat less 
natural. 

9.3.6 Fast generation of stochastic Kronecker graphs 

The intuition for fast generation of stochastic Kronecker graphs comes from the 
recursive nature of the Kronecker product and is closely related to the R-MAT graph 
generator [Chakrabarti ft al. 2004]. Generating a stochastic Kronecker graph K on 
N nodes naively takes 0{N'^) time. Here we present a linear time 0{E) algorithm, 
where E is the (expected) number of edges in K. 

Figure 9.6(c) shows the recursive nature of the Kronecker product. To "arrive" 
to a particular edge {vi , Vj ) oi Vk, one has to make a sequence of k (in our case 
k = 2) decisions among the entries of Vi, multiply the chosen entries of Vi, and 
then place the edge {vi , Vj ) with the obtained probability. 

Instead of flipping 0{N^) = 0{N^'') biased coins to determine the edges, we 
can place E edges by directly simulating the recursion of the Kronecker product. Ba- 
sically, we recursively choose subregions of matrix K with probability proportional 
to 9ij, 9ij G Vi, until in k steps we descend to a single cell of the big adjacency 
matrix K and place an edge. For example, for (w2, fs) in Figure 9.6(c), we first have 
to choose b followed by c. 

The probability of each individual edge of Vk follows a Bernoulli distribution, 
as the edge occurrences are independent. By the Central Limit Theorem [Pctrov 1995], 
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the number of edges in Vk tends to a normal distribution with mean (X]i'/=i ^ij )'^ — 
, where Oij S Vi- So, given a stochastic initiator matrix Pi, we first sample the 
expected number of edges E inVk- Then we place E edges in a graph K by apply- 
ing the recursive descent for k steps where at each step we choose entry with 
probability 9ij/M where M = Oij. Since we add E = edges, the probabiHty 
that edge {vi,Vj) appears in K is exactly Vk[vi, Vj\. In stochastic Kronecker graphs, 
the initiator matrix encodes both the total number of edges in a graph and their 
structure. ^ Oij encodes the number of edges in the graph, while the proportions 
(ratios) of values Oij define how many edges each part of a graph adjacency matrix 
will contain. 

In practice, it can happen that more than one edge lands in the same {vi,Vj) 
entry of big adjacency matrix K. If an edge lands in an already occupied cell, we 
insert it again. Even though values of Vi are usually skewed, adjacency matrices of 
real networks are so sparse that this is not really a problem in practice. Empirically 
we note that around 1% of edges collide. 

9.3.7 Observations and connections 

Next, we describe several observations about the properties of Kronecker graphs 
and make connections to other network models. 

• Bipartite graphs: Kronecker graphs can naturally model bipartite graphs. In- 
stead of starting with a square N x N initiator matrix, one can choose an 
arbitrary N x Mi initiator matrix, where rows define the "left" and columns 
the "right" side of the bipartite graph. Kronecker multiplication will then 
generate bipartite graphs with partition sizes N'^ and Mf . 

• Graph distributions: Vk defines a distribution over all graphs as it encodes 
the probability of all possible N'^'^ edges appearing in a graph by using an 
exponentially smaller number of parameters (just N'^). As we will later see, 
even a very small number of parameters, e.g., 4 (2 x 2 initiator matrix) or 9 
(3x3 initiator), is enough to accurately model the structure of large networks. 

• Extension of Erdos-Renyi random graph model: Stochastic Kronecker graphs 
represent an extension of Erdos-Renyi random graphs [Erdos fc A. Rcnyi 1960] 
If one takes Vi = where every Oij = p, then we obtain exactly the Erdos- 
Renyi model of random graphs Gn,p, where every edge appears independently 
with probability p. 

• Relation to the R-MAT model: The recursive nature of stochastic Kronecker 
graphs makes them related to the R-MAT generator [Chakrabarti et al. 2004]. 
The difference between the two models is that in R-MAT one needs to sepa- 
rately specify the number of edges, while in stochastic Kronecker graphs ini- 
tiator matrix Vi also encodes the number of edges in the graph. Section 9.3.6 
built on this similarity to devise a fast algorithm for generating stochastic 
Kronecker graphs. 
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• Densification: Similarly as with deterministic Kronecker graphs, the number 
of nodes in a stochastic Kronecker graph grows as N'', and the expected 
number of edges grows as {J2ij (^ij)''- This means one would want to choose 
values 9ij of the initiator matrix Vi so that J^ij (^ij > ^ in order for the 
resulting network to densify. 

9.4 Simulations of Kronecker graphs 

Next we perform a set of simulation experiments to demonstrate the ability of 
Kronecker graphs to match the patterns of real-world networks. In the next section, 
we will tackle the problem of estimating the Kronecker graph model from real data, 
i.e., finding the most likely initiator Vi- Here we present simulation experiments 
using Kronecker graphs to explore the parameter space and to compare properties 
of Kronecker graphs to those found in large real networks. 

9.4.1 Comparison to real graphs 

We observe two kinds of graph patterns — "static" and "temporal." As mentioned 
earlier, common static patterns include degree distribution, scree plot (eigenvalues 
of graph adjacency matrix versus rank), and distribution of components of the 
principal eigenvector of a graph adjacency matrix. Temporal patterns include the 
diameter over time and the densification power law. For the diameter computation, 
we use the effective diameter as defined in Section 9.2. 

For the purpose of this section, consider the following setting. Given a real 
graph G, we want to find the Kronecker initiator that produces a qualitatively 
similar graph. In principle, one could try choosing each of the N'^ parameters for 
the matrix Vi separately. However, we reduce the number of parameters from N'^ to 
just two: a and /3. Let Ki be the initiator matrix (binary, deterministic). Then we 
create the corresponding stochastic initiator matrix Vi by replacing each "1" and 
"0" of Ki with a and /3, respectively (/3 < a). The resulting probability matrices 
maintain — with some random noise — the self-similar structure of the Kronecker 
graphs in the previous section (which, for clarity, we call deterministic Kronecker 
graphs). We defer the discussion of how to automatically estimate Vi from data G 
to the next section. 

The data sets we use here are the following: 

• CiT-HEP-TH: This is a citation graph for high-energy physics theory research 
papers from preprint archive ArXiv, with a total of TV = 29,555 papers and 
E = 352,807 citations [Gchrkc ct al. 2003]. We follow the citation graph's 
evolution from January 1993 to April 2003, with one data point per month. 

• As-RouteViews: We also analyze a static data set consisting of a single 
snapshot of connectivity among Internet autonomous systems [RoutcViows 1997] 
from January 2000, with N ^ 6474 and E = 26,467. 

Results are shown in Figure 9.7 for the Cit-hep-th graph which evolves over 
time. We show the plots of one static and one temporal pattern. We see that 
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Degree Nodes 

(a) Degree (b) DPL 

Figure 9.7. Citation network (Cit-hep-th). 
Patterns from the real graph (top row), the deterministic Kronecker 
graph with Ki being a star graph on four nodes (center + three satel- 
lites) (middle row), and the stochastic Kronecker graph {a = 0.41, P = 
0.11, bottom row), (a) is the PDF of degrees in the graph (log-log scale), 
and (b) is the number of edges versus number of nodes over time (log-log 
scale). Notice that the stochastic Kronecker graph qualitatively matches 
all the patterns very well. 



the deterministic Kronecker model already to some degree captures the qualitative 
structure of the degree distribution, as well as the temporal pattern represented by 
the densification power law. However, the deterministic nature of this model results 
in discrete behavior, as shown in the degree distribution plot for the deterministic 
Kronecker graph of Figure 9.7. We see that the stochastic Kronecker graphs smooth 
out these distributions, further matching the qualitative structure of the real data. 

Similarly, Figure 9.8 shows plots for the static patterns in the autonomous 
systems (As-RouteViews) graph. Recall that we analyze a single, static network 
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(a) Degree (b) "Network value" 

Figure 9.8. Autonomous systems (As-Route Views). 
Real (top) versus Kronecker (bottom). Column (a) shows the degree 
distribution. Column (b) shows a more static pattern. Notice that, 
again, the stochastic Kronecker graph matches well the properties of 
the real graph. 



snapshot in this case. In addition to the degree distribution, we show a typi- 
cal plot [Chakrabarti et al. 2004] of the distribution of network values (principal 
eigenvector components, sorted, versus rank). Notice that, again, the stochastic 
Kronecker graph matches well the properties of the real graph. 



9.4.2 Parameter space of Kronecker graphs 

Last, we present simulation experiments that investigate the parameter space of 
stochastic Kronecker graphs. 

First, in Figure 9.9, we show the ability of Kronecker graphs to generate 
networks with increasing, constant, and decreasing/stabilizing effective diameter. 
We start with a four-node chain initiator graph (shown in the top row of Figure 9.3), 
setting each "1" of Ki to a and each "0" to /? = 0 to obtain Vi that we then use to 
generate a growing sequence of graphs. We plot the effective diameter of each R{Vk) 
as we generate a sequence of growing graphs i?(7^2), RCPs), ■ ■ ■ , RiJ-'w)- RiJ-'w) has 
exactly 1,048,576 nodes. Notice that stochastic Kronecker graphs are very flexible 
models. When the generated graph is very sparse (low value of a), we obtain graphs 
with slowly increasing effective diameter (Figure 9.9 (top)). For intermediate values 
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Figure 9.9. Effective diameter over time. 

Time evolution of a 4-node chain initiator grapti. Top: increasing di- 
ameter (a — 0.38, /3 = 0). Middle: constant diameter {a = 0.43, /3 = 0). 
Bottom: decreasing diameter (a = 0.54,/? = 0). After each consecutive 
Kronecker power, we measure the effective diameter. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



9.5. Kronecker graph model estimation 



163 



of a, we get graphs with constant diameter (Figure 9.9 (middle)) that, in our case, 
also slowly density with densification exponent a = 1.05. Lastly, we see an example 
of a graph with shrinking/stabilizing effective diameter. Here we set a — 0.54, 
which results in a densification exponent of a = 1.2. Note that these observations 
are not contradicting Theorem 9.11. Actually, these simulations agree well with the 
analysis of [Mahdian & Xu 2007]. 

Next, we examine the parameter space of a stochastic Kronecker graph for 
which we choose a star on four nodes as an initiator graph and parameterize with a 
and /3 as before. The initiator graph and the structure of the corresponding (deter- 
ministic) Kronecker graph adjacency matrix is shown in the top row of Figure 9.3. 

Figure 9.10 (top) shows the sharp transition in the fraction of the number of 
nodes that belong to the largest weakly connected component as we fix /3 = 0.15 
and slowly increase a. Such phase transitions on the size of the largest connected 
component also occur in Erdos-Renyi random graphs. Figure 9.10 (middle) further 
explores this by plotting the fraction of nodes in the largest connected component 
(Nc/N) over the full parameter space. Notice the sharp transition between discon- 
nected (white area) and connected graphs (dark). 

Last, Figure 9.10 (bottom) shows the effective diameter over the parameter 
space (a, /?) for the four-node star initiator graph. Notice that when parameter val- 
ues are small, the effective diameter is small since the graph is disconnected and not 
many pairs of nodes can be reached. The shape of the transition between low-high 
diameter closely follows the shape of the emergence of the connected component. 
Similarly, when parameter values are large, the graph is very dense and the diame- 
ter is small. There is a narrow band in parameter space where we get graphs with 
interesting diameters. 

9.5 Kronecker graph model estimation 

In previous sections, we investigated various properties of networks generated by the 
(stochastic) Kronecker graphs model. Many of these properties were also observed 
in real networks. Moreover, we also gave closed form expressions (parametric forms) 
for values of these statistical network properties, allowing us to calculate a property 
(e.g., diameter, eigenvalue spectrum) of a network directly from just the initiator 
matrix. So in principle, one could invert these equations and directly get from a 
property (e.g., shape of degree distribution) to the values of initiator matrix. 

However, in previous sections, we did not say anything about how various 
network properties of a Kronecker graph correlate and interdepend. For example, it 
could be the case that two network properties are mutually exclusive. For instance, 
perhaps one could only match the network diameter but not the degree distribution 
or vice versa. However, as we show later, this is not the case. 

Now we turn our attention to automatically estimating the Kronecker initiator 
graph. The setting is that we are given a real network G and would like to find a 
stochastic Kronecker initiator Pi that produces a synthetic Kronecker graph K that 
is "similar" to G. One way to measure similarity is to compare statistical network 
properties, such as diameter and degree distribution, of graphs G and K. 
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Figure 9.10. Largest weakly connected component. 

Fraction of nodes in the largest weakly connected component (Nc/N) 
and the effective diameter for four-star initiator graph. Top: largest 
component size; we fix /? = 0.15 and vary a. Middle: largest component 
size; we vary both a and /3. Bottom: effective diameter of the network. 
If the network is disconnected or has very dense path lengths that are 
short, then the diameter is large when the network is barely connected. 
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Comparing statistical properties already suggests a very direct approach to 
this problem: one could first identify the set of network properties (statistics) to 
match, then define a quality of fit metric and somehow optimize over it. For ex- 
ample, one could use the KL divergence [Kullback & Loiblcr 1951] or the sum of 
squared differences between the degree distribution of the real network G and its 
synthetic counterpart K . Moreover, as we are interested in matching several such 
statistics between the networks, one would have to meaningfully combine these in- 
dividual error metrics into a global error metric. So, one would have to specify 
what kind of properties he or she cares about and then combine them accordingly. 
This would be a hard task as the patterns of interest have very different magnitudes 
and scales. Moreover, as new network patterns are discovered, the error functions 
would have to be changed and models re-estimated. And even then it is not clear 
how to define the optimization procedure to maximize the quality of fit and how to 
perform optimization over the parameter space. 

Our approach here is different. Instead of committing to a set of network 
properties ahead of time, we try to directly match the adjacency matrices of the 
real network G and its synthetic counterpart K. The idea is that if the adjacency 
matrices are similar, then the global statistical properties (statistics computed over 
K and G) will also match. Moreover, by directly working with the graph itself (and 
not summary statistics) , we do not commit to any particular set of network statistics 
(network properties/patterns), and as new statistical properties of networks are 
discovered, our models and estimated parameters will still hold. 

9.5.1 Preliminaries 

Stochastic graph models induce probability distributions over graphs. A genera- 
tive model assigns a probability P{G) to every graph G. -P(G) is the likelihood 
that a given model (with a given set of parameters) generates the graph G. We 
concentrate on the stochastic Kronecker graph model and consider fitting it to a 
real graph G, our data. We use the maximum-likelihood approach, i.e., we aim to 
find parameter values, the initiator Pi, that maximize ^'(G) under the stochastic 
Kronecker graph model. 

This approach presents several challenges: 

• Model selection: A graph is a single structure and not a set of items drawn 
independently and identically distributed (i.i.d.) from some distribution. So, 
one cannot split it into independent training and test sets. The fitted pa- 
rameters will thus be best to generate a particular instance of a graph. Also, 
overfitting could be an issue since a more complex model generally fits better. 

• Node correspondence: The second challenge is the node correspondence 
or node labeling problem. The graph G has a set of N nodes, and each 
node has a unique label (index, ID) . Labels do not carry any particular mean- 
ing, they just uniquely denote or identify the nodes. One can think of this 
as the graph is first generated and then the labels (node IDs) are randomly 
assigned. This means that two isomorphic graphs that have different node la- 
bels should have the same likelihood. A permutation a is sufficient to describe 
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the node correspondences as it maps labels (IDs) to nodes of the graph. To 
compute the likelihood -P(G), one has to consider all node correspondences 
P(G) = P{G\a)P{a), where the sum is over all A''! permutations a of N 
nodes. Calculating this superexponential sum explicitly is infcasiblc for any 
graph with more than a handful of nodes. Intuitively, one can think of this 
summation as some kind of graph isomorphism test where we are searching 
for the best correspondence (mapping) between nodes of G and V. 

• Likelihood estimation: Even if wc assume one can efficiently solve the 
node correspondence problem, calculating P[G\a) naively takes 0{N'^) as 
one has to evaluate the probability of each of the N'^ possible edges in the 
graph adjacency matrix. Again, for graphs of the size we want to model here, 
approaches with quadratic complexity are infeasible. 

To develop our solution, we use sampling to avoid the superexponential sum 
over the node correspondences. By exploiting the structure of the Kronecker matrix 
multiplication, we develop an algorithm to evaluate P(G|ct) in linear time 0{M). 
Since real graphs are sparse, i.e., the number of edges is roughly of the same order 
as the number of nodes, this makes fitting of Kronecker graphs to large networks 
feasible. 

9.5.2 Problem formulation 

Suppose we are given a graph G on iV = nodes (for some positive integer k) and 
a.n N X N stochastic Kronecker graphs initiator matrix V\ . Here Vi is a parameter 
matrix, a set of parameters that wc aim to estimate. For now, also assume TV, the 
size of the initiator matrix, is given. Later, we will show how to automatically select 
it. Next, using V^, we create a stochastic Kronecker graph probability matrix Vk, 
where every entry puv of Vk contains a probability that node u links to node v. We 
then evaluate the probability that G is a realization of Vk- The task is to find such 
Vi that has the highest probability of realizing (generating) G. 
Formally, we are solving 

argmaxP(G|Pi) (9.3) 

To keep the notation simpler, wc use standard symbol O to denote the param- 
eter matrix Vi that we are trying to estimate. We denote entries oi Q — Vi = 
and similarly wc denote V = Vk = [Pij]- Note that here we slightly simplified the 
notation: we use O to refer to Vi, and 9ij are elements of 0. Similarly, pij are 
elements of V (= Vk)- Moreover, we denote K = R{V): i.e., X is a realization of 
the stochastic Kronecker graph sampled from probabilistic adjacency matrix V. 

As noted before, because the node IDs are assigned arbitrarily and they carry 
no significant information, we have to consider all the mappings of nodes from G 
to rows and columns of stochastic adjacency matrix V. A priori all labelings are 
equally likely. A permutation <j of the set {!,..., N} defines this mapping of nodes 
from G to stochastic adjacency matrix V. To evaluate the likelihood of G, one 
needs to consider all possible mappings of N nodes of G to rows (columns) of V. 
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Figure 9.11. Kronecker parameter estimation as an optimization 
problem. We search over the initiator matrices 8 {= Vi). Using Kroneclcer mul- 
tiplication, we create probabilistic adjacency matrix O***^ that is of the same size 
as real network G. Now, we evaluate the likelihood by simultaneously traversing 
and multiplying entries of G and Q^'^ (see equation (9.5)). As shown by the figure, 
permutation a plays an important role, as permuting rows and columns of G could 
make it look more similar to G*^*"' and thus increase the likelihood. 

For convenience, we work with log-likelihood 1{Q), and solve Q = argmaxe/(6), 
where Z(8) is defined as 

m = logP(G|e) = log^P(G|e,a)P(a|e) 

(T 

= log^P(G|e,a)P(a) (9.4) 

(T 

The likelihood that a given initiator matrix Q and permutation a gave rise 
to the real graph G, P(G|9,it) is calculated naturally as follows. First, by using 
Q, we create the stochastic Kronecker graph adjacency matrix V = Vk = 6'^'^. 
Permutation a defines the mapping of nodes of G to the rows and columns of 
stochastic adjacency matrix V. (See Figure 9.11 for the illustration.) 

We then model edges as independent Bernoulli random variables parameter- 
ized by the parameter matrix O. So, each entry p.^v of V gives exactly the probability 
of edge (u, v) appearing. 

We then define the likelihood 

P(G|P,a)= II V[au,c7,] (9-5) 

(u.v)eG (u.v)fG 

where we denote (7^ as the ith element of the permutation cr, and V[i,j] is the 
element at row i and column j of matrix V = O***^. 

The likelihood is defined very naturally. We traverse the entries of adjacency 
matrix G and then, on the basis of whether a particular edge appeared in G or not, 
we take the probability of the edge occurring (or not) as given by V and multiply 
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these probabilities. As one has to touch ah the entries of the stochastic adjacency 
matrix V, evaluating equation (9.5) takes 0{N^) time. 

We further illustrate the process of estimating stochastic Kronecker initiator 
matrix Q in Figure 9.11. We search over initiator matrices & to find the one that 
maximizes the likelihood P{G\Q). To estimate P{G\Q), we are given a concrete 
0, and now we use Kronecker multiplication to create probabilistic adjacency ma- 
trix O®'' that is of same size as real network G. Now, we evaluate the likelihood 
by traversing the corresponding entries of G and Q'^''. Equation (9.5) basically 
traverses the adjacency matrix of G and maps every entry {u, v) of G to a cor- 
responding entry (amUv) of V. Then, in case that edge {u,v) exists in G (i.e., 
G[w,'i;] = 1), the likelihood of that particular edge existing is P[cr„,cr„], and simi- 
larly, in case the edge {u,v) does not exist, the likelihood is simply 1 — V[au,(Jv]- 
This also demonstrates the importance of permutation tr, as permuting rows and 
columns of G could make the adjacency matrix look more "similar" to B®'' and 
would increase the likelihood. 

So far, we showed how to assess the quality (likelihood) of a particular 8. So, 
naively one could perform some kind of grid search to find best O. However, this is 
very inefficient. A better way is to compute the gradient of the log-likelihood ^Z(9) 
and then use the gradient to update the current estimate of Q and move towards 
a solution of higher likelihood. Algorithm 9.1 gives an outline of the optimization 
procedure. 

Algorithm 9.1. Kronecker fitting. 

Input consists of size of parameter matrix N, graph G on iV = N'^ nodes, and learn- 
ing rate A. Output consists of Maximum Likelihood Estimation (MLE) parameters 
Q {N X N probability matrix). 

e = KronFit(7V, G, N, A) 

1 initialize Oi 

2 v^riiile not converged 

3 do evaluate gradient: ^|-^(0t) 

4 update parameter estimates: Qt+i — Ot + 

5 return Q — Qt 

However, there are several difficulties with this algorithm. First, we are assum- 
ing gradient-descent-type optimization will find a good solution, i.e., the problem 
does not have (too many) local minima. Second, we are summing over exponentially 
many permutations in equation (9.4). Third, the evaluation of equation (9.5) as it 
is written now takes 0{N'^) time and needs to be evaluated A^! times. So, given a 
concrete 0, just naively calculating the likelihood takes 0{N\N'^) time, and then 
one also has to optimize over O. 

Observation 2. The complexity of naively calculating the likelihood P(G|0) oj the 
graph G is 0{N\N'^), where N is the number of nodes in G. 

Next, we show that all this can be done in linear time. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



9.5. Kronecker graph model estimation 



169 



9.5.3 Summing over the node labelings 

To maximize equation (9.3) using Algorithm 9.1, we need to obtain the gradient of 
the log-Ukehhood ^^(0). We can write 

A/ro^ ^ i:.^^(G|a,e)F(a) 
dQ^' E.'nGk',e)P(a') 

E /^°^l^'"^^ P(o|.e)P(.) 



E 



p(G|e) 
aiogP(G|cr,e) 



ae 



P((7|G,e) (9.6) 



Note that we are still summing over all iV! permutations cr, so calculating 
equation (9.6) is computationally intractable for graphs with more than a handful 
of nodes. However, the equation has a nice form that allows for use of simulation 
techniques to avoid the summation over superexponentially many node correspon- 
dences. Thus, we simulate draws from the permutation distribution P(a\G, 6), and 
then evaluate the quantities at the sampled permutations to obtain the expected 
values of log-likelihood and gradient. Algorithm 9.2 gives the details. 

Algorithm 9.2. Calculating log-likelihood and gradient. 

Input consists of parameter matrix Q and graph G. Output consists of log- likelihood 
l{e) and gradient ^^(6). 

(K©), M^(©)) = KRONCALCGRAD(e, G) 

1 for t = 1 to T 

2 do at = SamplePermutation(G, 9) 

3 /t = logP(G|a(*),e) 

4 grad, = ^logP(G|a(*),e) 

5 return /(9) = ^ J2t ^t' ^'^d ^^(0) = ^ J2t g™^t 

Note that we can also permute the rows and columns of the parameter matrix 
Q to obtain equivalent estimates. Therefore, Q is not strictly identifiable because 
of these permutations. Since the space of permutations on N nodes is very large 
(grows as N\), the permutation sampling algorithm will explore only a small fraction 
of the space of all permutations and may converge to one of the global maxima (but 
may not explore all A^! of them) of the parameter space. As we empirically show 
later, our results are not sensitive to this and multiple restarts result in equivalent 
(but often permuted) parameter estimates. 

Sampling permutations 

Next, we describe the Metropolis algorithm to simulate draws from the permutation 
distribution P(ct|G, Q), which is given by 

P(a,G,e) _P(o-,G,e) 



p(a|G,e) = 



E,P(T,G,e) z 
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where Z is the normahzing constant that is hard to compute since it involves the sum 
over A^! elements. However, if we compute the likelihood ratio between permutations 
CT and cr' (equation (9.7)), the normalizing constants nicely cancel out 

The above formula suggests the use of a Metropolis sampling algorithm (see 
[Gamerman 1997]) to simulate draws from the permutation distribution since 
Metropolis is solely based on such ratios (where normalizing constants cancel out). 
In particular, suppose that in the Metropolis algorithm (Algorithm 9.3) we consider 
a move from permutation cr to a new permutation a' . The probability of accepting 
the move to a' is given by equation (9.7) (if ^^cr\G 'e) — °^ ^ otherwise. 

Algorithm 9.3. Sample permutation. 

Metropolis sampling of the node permutation, hiput consists of the Kronecker 
initiator matrix Q and a graph G on nodes. Output consists of permuta- 
tion crW ~ P(cr|G,e). C/(0,1) is a uniform distribution on [0,1], and cr' := 
SwapNodes (cr, j, fc) is the permutation cr' obtained from cr by swapping elements 
at positions j and k. 

cr*^*' = SamplePermutation(G, 9) 

1 aW-(l,...,7V) 

2 i = 1 

3 repeat 

4 Draw j and k uniformly from (1, . . . , N) 

5 cr(*) = SwapNodes(cr(*-i), 

6 Draw u from U{Q, 1) 

7 if u > Pi'^"'\G,e) 

8 then cr(^) = cr(*-i) 

9 i = i + 1 

10 until crW ~ P(ct I G, 9) 

11 return cr^'^ 

Now we have to devise a way to sample permutations cr from the proposal 
distribution. One way to do this would be to simply generate a random permutation 
cr' and then check the acceptance condition. This approach would be very inefhcient 
as we expect the distribution P(cr|G,9) to be heavily skewed; i.e., there will be a 
relatively small number of good permutations (node mappings). Even more so 
as the degree distributions in real networks are skewed, there will be many bad 
permutations with low likelihood and few good ones that do a good job in matching 
nodes of high degree. 
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To make the sampling process "smoother," i.e., sample permutations that 
are not that different (and thus are not randomly jumping across the permuta- 
tion space), we design a Markov chain. The idea is to stay in the high- likelihood 
part of the permutation space longer. We do this by making samples dependent; 
i.e., given tr', we want to generate next candidate permutation a" to then eval- 
uate the likelihood ratio. When designing the Markov chain step, one has to be 
careful so that the proposal distribution satisfies the detailed balance condition: 
TT{a')P{a'\a") = TT{(j")P{a"\a'), where P{a'\a") is the transition probability of 
obtaining permutation cr' from cr", and ^{a') is the stationary distribution. 

In Algorithm 9.3, we use a simple proposal where, given permutation cr', we 
generate a" by swapping elements at two uniformly at random chosen positions of 
cr'. We refer to this proposal as SwapNod.es. While this is simple and clearly satisfies 
the detailed balance condition, it is also inefficient in a way that most of the times 
low degree nodes will get swapped (a direct consequence of heavy-tailed degree 
distributions). This has two consequences: (a) we will slowly converge to good 
permutations (accurate mappings of high degree nodes), and (b) once we reach 
a good permutation, very few permutations will get accepted as most proposed 
permutations a' will swap low degree nodes (as they form the majority of nodes). 

A possibly more efficient way would be to swap elements of a based on cor- 
responding node degree, so that high degree nodes would get swapped more often. 
However, doing this directly does not satisfy the detailed balance condition. A way 
of sampling labels biased by node degrees that at the same time satisfies the detailed 
balance condition is the following: we pick an edge in G uniformly at random and 
swap the labels of the nodes at the edge endpoints. Notice this is biased towards 
swapping labels of nodes with high degrees simply as they have more edges. The 
detailed balance condition holds as edges are sampled uniformly at random. We 
refer to this proposal as SwapEdgeEndpoints. 

However, the issue with this proposal is that if the graph G is disconnected, 
we will only be swapping labels of nodes that belong to the same connected compo- 
nent. This means that some parts of the permutation space will never get visited. 
To overcome this problem, we execute SwapNodes with some probability uj and 
SwapEdgeEndpoints with probability 1 — uj. 

To summarize, we consider the following two permutation proposal distribu- 
tions: 

• a" = SwapNodes (cr'): we obtain cr" by taking cr', uniformly at random select- 
ing a pair of elements and swapping their positions. 

• cr" = SwapEdgeEndpoints(cr'): we obtain cr" from cr' by first sampling an 
edge (j/, k) from G uniformly at random, then we take cr' and swap the labels 
at positions j and k. 

Speeding up the likelihood ratio calculation 

We further speed up the algorithm by using the following observation. As writ- 
ten, equation (9.7) takes 0(7V^) to evaluate since we have to consider iV^ possible 
edges. However, notice that permutations a and cr' differ only at two positions, i.e.. 
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elements at position j and k are swapped, i.e., a and cr' map all nodes except the 
two to the same locations. This means those elements of equation (9.7) cancel out. 
Thus to update the likelihood, we only need to traverse two rows and columns of 
matrix V, namely rows and columns j and fc, since everywhere else the mapping of 
the nodes to the adjacency matrix is the same for both permutations. This results 
in equation (9.8), where the products now range only over the two rows/columns 
of V where a and a' differ. 

Graphs we are working with here are too large to allow us to explicitly create 
and store the stochastic adjacency matrix V by Kronecker-powering the initiator 
matrix Q. Every time probability V[i,j] of edge is needed, equation (9.2) is 
evaluated, which takes 0{k). So a single iteration of Algorithm 9.3 takes 0{kN). 

Observation 3. Sampling a permutation a from P(cr|G,0) takes 0{kN). 

So far, we have shown how to obtain a permutation, but we still need to 
evaluate the likelihood and find the gradients that will guide us in finding a good 
initiator matrix. Naively evaluating the network likelihood (gradient) as written in 
equation (9.6) takes 0{N^) time. 



9.5.4 Efficiently approximating likelihood and gradient 

We just showed how to efficiently sample node permutations. Now, given a permu- 
tation, we show how to efficiently evaluate the likelihood and its gradient. Similar 
to evaluating the likelihood ratio, naively calculating the log-likelihood Z(8) or its 
gradient ■^1(0) takes time quadratic in the number of nodes. Next, we show how 
to compute this in linear time 0{M). 

We begin with the observation that real graphs are sparse, that is, the number 
of edges is not quadratic but rather almost linear in the number of nodes, M <^ N'^. 
This means that the majority of entries of the graph adjacency matrix are zero, i.e., 
most of the edges are not present. We exploit this fact. The idea is to first calculate 
the likelihood (gradient) of an empty graph, i.e., a graph with zero edges, and then 
correct for the edges that actually appear in G. 

To naively calculate the likelihood for an empty graph, one needs to evaluate 
every cell of the graph adjacency matrix. We consider Taylor approximation to the 
likelihood, and exploit the structure of matrix V to devise a constant-time algorithm. 

First, consider the second order Taylor approximation to the log- likelihood of 
an edge that succeeds with probability x but does not appear in the graph 

log(l -x) K-x- 

Galculating le{&), the log-likelihood of an empty graph, becomes 

N N / N N / N N X*^ 

ue) = E E " ^'^.) - - E E % - ^ E E (9-9) 
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Notice that while the first pair of sums ranges over N elements, the last 
pair only ranges over N elements {N = log^.N). Equation (9.9) holds due to the 
recursive structure of matrix V generated by the Kronecker product. We substitute 
the log(l — Pij) with its Taylor approximation, which gives a sum over elements 
of V and their squares. Next, we notice that the sum of elements of V forms a 
multinomial series, and thus ^ ■ ■ pij = ■ j dij)'^, where 6ij denotes an element of 

0, and Pij an element of 8*^*^. 

Calculating the log- likelihood of G now takes 0(M). First, we approximate 
the likelihood of an empty graph in constant time and then account for the edges 
that are actually present in G; i.e., we subtract the "no-edge" likelihood and add 
the "edge" likelihoods 

Z(e) = Ze(e)-|- -^og{l-V[au,a,])+\og(V[cTu,a,]) 

We note that by using the second order Taylor approximation to the log- 
likelihood of an empty graph, the error term of the approximation is ^(X^i^y^)'^' 
which can diverge for large k. For typical values of initiator matrix Vi (that we 
present in Section 9.6.5), we note that one needs about a fourth- or fifth-order Taylor 
approximation for the error of the approximation to actually go to zero as k ap- 
proaches infinity, i.e., X^^- Oij"^^ < 1, where n is the order of Taylor approximation 
employed. 

9.5.5 Calculating the gradient 

Calculation of the gradient of the log-likelihood follows exactly the same pattern 
as described above. First, by using the Taylor approximation, we calculate the 
gradient as if graph G would have no edges. Then, we correct the gradient for the 
edges that are present in G. As in the previous section, we speed up the calculations 
of the gradient by exploiting the fact that two consecutive permutations a and a' 
differ only at two positions, and thus given the gradient from the previous step, one 
only needs to account for the swap of the two rows and columns of the gradient 
matrix dV/dQ to update to the gradients of individual parameters. 

9.5.6 Determining the size of an initiator matrix 

The question we answer next is how to determine the right number of parameters, 

1. e., what is the right size of matrix 0? This is a classical question of model selection 
in which there is a tradeoff between the complexity of the model and the quality of 
the fit. A bigger model with more parameters usually fits better; however, it is also 
more likely to overfit the data. 

For model selection to find the appropriate value of N, the size of matrix 0, 
and to choose the right tradeoff between the complexity of the model and the quality 
of the fit, we propose to use the Bayes Information Criterion (BIC) [Scliwarz 1978]. 
Stochastic Kronecker graph models the presence of edges with independent Bernoulli 
random variables, where the canonical number of parameters is TV^'^, which is a 
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function of a lower-dimensional parameter Q. This is then a curved exponential 
family [Efron 1975], and BIG naturally applies 

BIC{N) = -/(Oat) + ^N^log{N^) 

where Gat are the maximum-likelihood parameters of the model with N x N pa- 
rameter matrix, and N is the number of nodes in G. Note that one could also add 
an additional term to the above formula to account for multiple global maxima of 
the likelihood space, but as N is small, the additional term would make no real 
difference. 

As an alternative to BIG, one could also consider the Minimum Description 
Length (MDL) principle [Rissancn 1978] in which the model is scored by the quality 
of the fit plus the size of the description that encodes the model and the parameters. 

9.6 Experiments on real and synthetic data 

Next, we describe our experiments on a range of real and synthetic networks. We 
divide the experiments into several subsections. First, we examine the convergence 
and mixing of the Markov chain of our permutation sampling scheme. Then, we 
consider estimating the parameters of synthetic Kronecker graphs to see whether 
KronFit is able to recover the parameters used to generate the network. Last, we 
consider fitting stochastic Kronecker graphs to large real-world networks. 

9.6.1 Permutation sampling 

In our experiments, we considered both synthetic and real graphs. Unless mentioned 
otherwise, all synthetic Kronecker graphs were generated using — [0.8, 0.6; 0.5, 0.3], 
and k = 14, which gives us a graph G on iV =16,384 nodes and M =115,741 edges. 
We chose this particular as it resembles the typical initiator for real networks 
analyzed later in this section. 

Convergence of the log-likelihood and the gradient 

First, we examine the convergence of Metropolis permutation sampling, where per- 
mutations are sampled sequentially. A new permutation is obtained by modifying 
the previous one, which creates a Markov chain. We want to assess the convergence 
and mixing of the chain. We aim to determine how many permutations one needs to 
draw to reliably estimate the likelihood and the gradient, and also how long it takes 
until the samples converge to the stationary distribution. For the experiment, we 
generated a synthetic stochastic Kronecker graph using Vl as defined above. Then, 
starting with a random permutation, we ran Algorithm 9.3 and measured how the 
likelihood and the gradients converge to their true values. 

In this particular case, we first generated a stochastic Kronecker graph G as 
described above, but then calculated the likelihood and the parameter gradients for 
O' = [0.8, 0.75; 0.45, 0.3]. We averaged the likelihoods and gradients over buckets 
of 1000 consecutive samples and plotted how the log-likelihood, calculated over the 
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Figure 9.12. Convergence of the log-likelihood. 

Components of the gradient toward their true values for Metropolis per- 
mutation sampling (Algorithm 9.3) with the number of samples. 



sampled permutations, approached the true log-likelihood (that we can compute 
since G is a stochastic Kronecker graph). 

First, we present experiments that aim to answer how many samples (i.e., 
permutations) does one need to draw to obtain a reliable estimate of the gradient 
(see equation (9.6)). Figure 9.12(a) shows how the estimated log-likelihood ap- 
proaches the true likelihood. Notice that estimated values quickly converge to their 
true values, i.e.. Metropolis sampling quickly moves towards "good" permutations. 
Similarly, Figure 9.12(b) plots the convergence of the gradients. Notice that On 
and 6*22 of G' and Vl match, so gradients of these two parameters should converge 
to zero and indeed they do. On the other hand, 9i2 and 6121 differ between O' and 
P^. Notice that the gradient for one is positive as the parameter 612 of Q' should 
be decreased, and similarly for ^21, the gradient is negative as the parameter value 
should be increased to match the 8'. In summary, this shows that log-likelihood 
and gradients rather quickly converge to their true values. 

In Figures 9.12(c) and (d), we also investigate the properties of the Markov 
chain Monte Carlo sampling procedure and assess convergence and mixing criteria. 
First, we plot the fraction of accepted proposals. It stabilizes at around 15%, 
which is quite close to the rule of thumb of 25%. Second, Figure 9.12(d) plots the 
autocorrelation of the log-likelihood as a function of the lag. Autocorrelation rk of 
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a signal X is a function of the lag k where is defined as the correlation of signal 
X at time t with X a,t t + k, i.e., correlation of the signal with itself at lag k. High 
autocorrelations within chains indicate slow mixing and, usually, slow convergence. 
On the other hand, fast decay of autocorrelation implies better mixing, and thus one 
needs fewer samples to accurately estimate the gradient or the likelihood. Notice 
the rather fast autocorrelation decay. 

All in all, these experiments show that one needs to sample on the order of 
tens of thousands of permutations for the estimates to converge. We also verified 
that the variance of the estimates is sufficiently small. In our experiments, we start 
with a random permutation and use long burn-in time. Then, when performing 
optimization, we use the permutation from the previous step to initialize the per- 
mutation at the current step of the gradient descent. Intuitively, small changes in 
parameter space 8 also mean small changes in P{a\G, O) . 

Different proposal distributions 

In Section 9.5.3, we defined two permutation sampling strategies: SwapNodes, where 
we pick two nodes uniformly at random and swap their labels (node IDs), and 
SwapEdgeEndpoints, where we pick a random edge in a graph and then swap the 
labels of the edge endpoints. We also discussed that one can interpolate between the 
two strategies by executing SwapNodes with probability w and SwapEdgeEndpoints 
with probability 1 — to. 

So, given a stochastic Kronecker graph G on iV =16,384 and M =115,741 
generated from Vl = [0.8, 0.7; 0.5, 0.3], we evaluate the likelihood of 9' = [0.8, 0.75; 
0.45,0.3]. As we sample permutations, we observe how the estimated likelihood 
converges to the true likelihood. Moreover, we also vary parameter cj, which in- 
terpolates between the two permutation proposal distributions. The quicker the 
convergence toward the true log-likelihood, the better the proposal distribution. 

Figure 9.13 plots the convergence of the log-likelihood with the number of 
sampled permutations. We plot the average over nonoverlapping buckets of 1000 
consecutive permutations. Faster convergence implies better permutation proposal 
distribution. When we use only SwapNodes (w = 1) or SwapEdgeEndpoints (w = 0), 
convergence is rather slow. We obtain the best convergence for uj around 0.6. 

Similarly, Figure 9.14(a) plots the autocorrelation as a function of the lag k 
for different choices of lu. Faster autocorrelation decay means better mixing of the 
Markov chain. Again, notice that we get the best mixing for ut « 0.6. (Notice 
logarithmic y-axis.) 

Last, we diagnose how long the sampling procedure must be run before the 
generated samples can be considered to be drawn (approximately) from the sta- 
tionary distribution. We call this the bum-in time of the chain. There are various 
procedures for assessing convergence. Here we adopt the approach of Gelman et 
al. [Gclman ct al. 2003], which is based on running multiple Markov chains each 
from a different starting point and then comparing the variance within the chain 
and between the chains. The sooner the within- and between-chain variances be- 
come equal, the shorter the burn- in time, i.e., the sooner the samples are drawn 
from the stationary distribution. 
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Convergence of the log-likelihood and gradients for Metropolis permu- 
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Figure 9.14. Autocorrelation as a function of uj. 

(a) Autocorrelation plot of the log-likelihood for the different choices 
of parameter uj. Notice we get best mixing with uj ~ 0.6. (b) The 
potential scale reduction that compares the variance inside and across 
independent Markov chains for different values of parameter uj. 



Let I be the parameter that is being simulated with J different chains, and then 

(k) 

let I J denote the fcth sample of the jth chain, where j — 1, . . . , J and k ~ 1, . . . ,K. 
More specifically, in our case we run separate permutation sampling chains. So, we 
first sample permutation crj and then calculate the corresponding log-likelihood 

7(fe) 
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First, we compute between- and within-chain variances and ct^, where 
between-chain variance is obtained by 



K ' 



where l.j = ^ Xlfc^i Ij and L. = 7 Ej=i/ j- 

Similarly, the within-chain variance is defined by 



j=i k=i 

Then, the marginal posterior variance of I is calculated using 

Finally, we estimate the potential scale reduction [Gclman ct al. 200-'>] of / by 




Note that as the length of the chain K — > 00, V R converges to 1 from above. 
The recommendation for convergence assessment from [Gclman ct al. 2003] is that 
the potential scale reduction is below 1.2. 

Figure 9.14(b) gives the Gelman-Rubin-Brooks plot, where we plot the po- 
tential scale reduction \/r over the increasing chain length K for different choices 
of parameter w. Notice that the potential scale reduction quickly decays towards 
1. Similarly, as in Figure 9.14, the extreme values of a; give slow decay, while we 
obtain the fastest potential scale reduction when uj « 0.6. 



Properties of the permutation space 

Next, we explore the properties of the permutation space. We would like to quan- 
tify what fraction of permutations are "good" (have high likelihood) and how 
quickly are they discovered. For the experiment, we took a real network G (As- 
ROUTE Views network) and the MLE parameters 9 for it that we estimated before- 
hand {1{Q) ~ —150,000). The network G has 6474 nodes, which means the space 
of all permutations has « 1022,000 elements. 

First, we sampled 1 billion (10^) permutations ai uniformly at random, i.e., 
P((Ti) = 1 /(6474!) and for each we evaluated its log-likelihood l{a^\&) = log P(0|G, ai). 
We ordered the permutations in deceasing log-likelihood and plotted l{(Ji\Q) versus 
rank. Figure 9.15(a) gives the plot. Notice that very few random permutations 
are very bad (i.e., they give low likelihood); similarly, few permutations are very 
good, while most of them are somewhere in between. Notice that best "random" 
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Figure 9.15. Distribution of log-likelihood. 

Distribution of log-likelihood of permutations sampled uniformly at 
random {l{Q\ai) where ai ^ P{'^)) (top), and when sampled from 
P{a\Q,G) (middle). Notice the space of good permutations is rather 
small, but our sampling quickly finds permutations of high likelihood. 
Convergence of log-likelihood for 10 runs of gradient descent, each from 
a different random starting point (bottom). 
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permutation has log-likelihood of ~ —320,000, which is far below true likelihood 
1{Q) « —150,000. This suggests that only a very small fraction of all permutations 
gives good node labelings. 

On the other hand, we also repeated the same experiment but now using 
permutations sampled from the permutation distribution ai ^ P{a\Q,G) via our 
Metropolis sampling scheme. Figure 9.15(b) gives the plot. Notice the radical 
difference. Now the /(cr|8i) very quickly converges to the true likelihood of ~ 
— 150,000. This suggests that while the number of "good" permutations (accurate 
node mappings) is rather small, our sampling procedure quickly converges to the 
"good" part of the permutation space, where node mappings are accurate, and 
spends the most time there. 

9.6.2 Properties of the optimization space 

In maximizing the likelihood, we use a stochastic approximation to the gradient. 
This adds variance to the gradient and makes efficient optimization techniques, like 
conjugate gradient, highly unstable. Thus, we use gradient descent, which is slower 
but easier to control. First, we make the following observation. 

Observation 4. Given a real graph G then finding the maximum-likelihood stochas- 
tic Kronecker initiator matrix Q 

6 = argmaxP(G|e) 
is a nonconvex optimization problem. 

Proof. By definition, permutations of the Kronecker graphs initiator matrix O all 
have the same log-likelihood. This means that we have several global minima that 
correspond to permutations of parameter matrix O, and then between them the 
log-likelihood drops. This means that the optimization problem is nonconvex. □ 

The above observation does not seem promising for estimating O using gradi- 
ent descent as it is prone to finding local minima. To test for this behavior, we ran 
the following experiment. We generated 100 synthetic Kronecker graphs on 16,384 
(2^'') nodes and 1.4 million edges on the average, each with a randomly chosen 2x2 
parameter matrix O*. For each of the 100 graphs, we ran a single trial of gradient 
descent starting from a random parameter matrix O' and try to recover O*. In 98% 
of the cases, the gradient descent converged to the true parameters. Many times 
the algorithm converged to a different global minima, i.e., O is a permuted version 
of original parameter matrix Q* . Moreover, the median number of gradient descent 
iterations was only 52. 

These results suggest a surprisingly nice structure of our optimization space: 
it seems to behave like a convex optimization problem with many equivalent global 
minima. Moreover, this experiment is also a good sanity check as it shows that, 
given a Kronecker graph, we can recover and identify the parameters that were used 
to generate it. 
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Moreover, Figure 9.15(c) plots the log-likelihood Z(Ot) of the current param- 
eter estimate 8t over the iterations t of the stochastic gradient descent. We plot 
the log-likelihood for 10 different runs of gradient descent, each time starting from 
a different random set of parameters Qq. Notice that in all runs, gradient descent 
always converges toward the optimum, and none of the runs gets stuck in some local 
maxima. 

9.6.3 Convergence of the graph properties 

We approached the problem of estimating stochastic Kronecker initiator matrix Q 
by defining the likelihood over the individual entries of the graph adjacency matrix. 
However, what we would really like is to be given a real graph G and then generate 
a synthetic graph K that has similar properties as the real G. By properties we 
mean network statistics that can be computed from the graph, e.g., diameter, degree 
distribution, clustering coefRcient, etc. A priori it is not clear that our approach, 
which tries to match individual entries of the graph adjacency matrix, will also be 
able to reproduce these global network statistics. However, as is shown next, this 
is not the case. 

To get some understanding of the convergence of the gradient descent in terms 
of the network properties, we performed the following experiment. After every step 
t of stochastic gradient descent, we compared the true graph G with the syn- 
thetic Kronecker graph Kt generated using the current parameter estimates Of 
Figure 9.16(a) gives the convergence of log-likelihood, and Figure 9.16(b) gives ab- 
solute error in parameter values (X) 1% ~ ^ijL where % £ Qt and 6*^ G 6*). 
Similarly, Figure 9.16(c) plots the effective diameter, and Figure 9.16(d) gives the 
largest singular value of graph adjacency matrix K as it converges to the largest 
singular value of G. 

The properties of Kt quickly converge to those of G even though we are not 
directly optimizing to the network properties of G. The log- likelihood increases, the 
absolute error of parameters decreases, and the diameter and the largest singular 
value of K( all converge to G. This is a nice result as it shows that, through max- 
imizing the likelihood, the resulting graphs become more and more similar also in 
their structural properties (even though we are not directly optimizing over them) . 

9.6.4 Fitting to real-world networks 

Next, we present experiments of fitting a Kronecker graph model to real- world 
networks. Given a real network G, we aim to discover the most likely parameters 
0 that ideally would generate a synthetic graph K having similar properties to real 
G. This assumes that Kronecker graphs are a good model of the network structure, 
and that KronFit is able to find good parameters. In the previous section, we 
showed that KronFit can efficiently recover the parameters. Now we examine how 
well a Kronecker graph can model the structure of real networks. 

We consider several different networks, such as a graph of connectivity among 
Internet autonomous systems (As-RouteViews) with N — 6474 and M = 26,467 a 
who-trusts-whom type social network from Epinions [Richardson ct al. 2003] 
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Figure 9.16. Convergence of graph properties. 

Convergence with the number of iterations of gradient descent using the 
synthetic data set. We start with a random choice of parameters, and 
with steps of gradient descent, the Kronecker graph better and better 
matches network properties of the target graph. 



(Epinions) with N = 75,879 and M = 508,960 and many others. The largest 
network we consider for fitting is Flickr, a photo-sharing online social network 
with 584,207 nodes and 3,555,115 edges. 

For the purpose of this section, we take a real network G, find parameters O 
using KronFit, generate a synthetic graph K using 9, and then compare G and K 
by comparing their properties that we introduced in Section 9.2. In all experiments, 
we start from a random point (random initiator matrix) and run gradient descent 
for 100 steps. At each step, we estimate the likelihood and the gradient on the basis 
of 510,000 sampled permutations from which we discard the first 10,000 samples to 
allow the chain to burn in. 

Fitting to autonomous systems network 

First, we focus on the autonomous systems (AS) network obtained from the Univer- 
sity of Oregon Route Views project [RoutoViews 1997]. Given the AS network G, 
we run KronFit to obtain parameter estimates &. Using the 9, we then generate 
a synthetic Kronecker graph K and compare the properties of G and K . 
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Figure 9.17. Autonomous systems (As-Route Views). 
Overlaid patterns of the real graph and the fitted Kronecker graph. 
Notice that the fitted Kronecker graph matches patterns of the real 
graph while using only four parameters (2x2 initiator matrix). 



Figure 9.17 shows the properties of As-Route Views and compares them with 
the properties of a synthetic Kronecker graph generated using the fitted parameters 
Q of size 2x2. Notice that the properties of both graphs match really well. The 
estimated parameters are 6 = [0.987, 0.571; 0.571, 0.049]. 

Figure 9.17(a) compares the degree distributions of the As-RouteViews net- 
work and its synthetic Kronecker estimate. In this and all other plots, we use the 
exponential binning, which is a standard procedure to de-noise the data when plot- 
ting on log-log scales. Notice a very close match in degree distribution between the 
real graph and its synthetic counterpart. 

Figure 9.17(b) plots the cumulative number of pairs of nodes g{h) that can 
be reached in < h hops. The hop plot gives a sense about the distribution of 
the shortest path lengths in the network and about the network diameter. Last, 
Figures 9.17(c) and (d) plot the spectral properties of the graph adjacency matrix. 
Figure 9.17(c) plots largest singular values versus rank, and Figure 9.17(d) plots the 
components of the left singular vector (the network value) versus the rank. Again 
notice the good agreement with the real graph while using only four parameters. 

Moreover, on all plots, the error bars of two standard deviations show the 
variance of the graph properties for different realizations R{Q^''). To obtain the 
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error bars, we took the same & and generated 50 realizations of a Kronecker graph. 
For most of the plots, the error bars are so small as to be practically invisible; 
this shows that the variance of network properties when generating a stochastic 
Kronecker graph is indeed very small. 

Also notice that the As-Route Views is an undirected graph and that the 
fitted parameter matrix Q is in fact symmetric. This means that without a priori 
biasing the fitting toward undirected graphs, the recovered parameters obey this 
aspect of the network. Fitting the As-Route Views graph from a random set of 
parameters, performing gradient descent for 100 iterations, and at each iteration 
sampling half a million permutations took less than 10 minutes on a standard desk- 
top PC. This is a significant speedup over [Bczakova ct al. 2000], where using a 
similar permutation sampling approach for calculating the likelihood of a preferen- 
tial attachment model on a similar As-RouteViews graph took about two days 
on a cluster of 50 machines. 

Choice of the initiator matrix size N 

As mentioned earlier, for finding the optimal number of parameters, i.e., selecting 
the size of the initiator matrix, BIC naturally applies to the case of Kronecker 
graphs. Figure 9.23(b) shows BIC scores for the following experiment. We generated 
a Kronecker graph with N — 2187 and A/ = 8736 using A^ = 3 (9 parameters) and 
k — 7. For 1 < A^ < 9, we find the MLE parameters using gradient descent 
and calculate the BIC scores. The model with the lowest score is chosen. As 
Figure 9.23(b) shows, we recovered the true model, i.e., BIC score is the lowest for 
the model with the true number of parameters, A^ = 3. 

Intuitively we expect a more complex model with more parameters to fit the 
data better. Thus, we expect larger N to generally give better likelihood. On the 
other hand, the fit will also depend on the size of the graph G. Kronecker graphs 
can only generate graphs on A^*"' nodes, while real graphs do not necessarily have 
N'' nodes (for some, preferably small, integers N and k). To solve this problem, 
we choose k so that N''~^ < N{G) < N'', and then augment G by adding N'' — N 
isolated nodes. Or equivalently, we pad the adjacency matrix of G with zeros until 
it is of the appropriate size, A^'^ x A^'^. While this solves the problem of requiring the 
integer power of the number of nodes, it also makes the fitting problem harder; for 
example, when N <C A^*^ , we are basically fitting G plus a large number of isolated 
nodes. 

Table 9.2 shows the results of fitting Kronecker graphs to As-Route Views 
while varying the size of the initiator matrix N. First, notice that, in general, 
larger A^ results in higher log- likelihood Z(8) at MLE. Similarly, notice (column 
A^*^) that while As-Route Views has 6474 nodes, Kronecker estimates have up to 
16,384 nodes (16, 384 = 4^, which is the first integer power of 4 greater than 6474). 
However, we also show the number of nonzero-degree (nonisolated) nodes in the 
Kronecker graph (column |{deg(u) > 0}|). Notice that the number of nonisolated 
nodes well corresponds to the number of nodes in the As-RouteViews network. 
This shows that KronFit is actually fitting the graph well, and it successfully fits 
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Table 9.2. Log-likelihood at MLE. 

MLE for different choices of ttie size of tlie initiator matrix N for the 
As-RouteViews graph. Notice that the log-hkelihood l{6) generally 
increases with the model complexity N. Also notice the effect of zero- 
padding, i.e., for iV = 4 and N = 5, the constraint of the number of 
nodes being an integer power of N decreases the log-likelihood. However, 
the column | {deg(u) > 0} | gives the number of nonisolated nodes in the 
network, which is much less than N'^ and is, in fact, very close to the 
true number of nodes in the As-RouteViews. Using the BIG scores, 
we see that iV = 3 or = 6 is the best choice for the size of the initiator 
matrix. 



N 








|{deg(«) > 0}| 


BIC score 


2 


-152,499 


8192 


25,023 


5675 


152,506 


3 


-127,066 


6561 


28,790 


5683 


127,083 


4 


-153,260 


16,384 


24,925 


8222 


153,290 


5 


-149,949 


15,625 


29,111 


9822 


149,996 


6 


-128,241 


7776 


26,557 


6623 


128,309 


As- RouteViews 


26,467 


6474 





the structure of the graph plus a number of isolated nodes. Last, column M*^ gives 
the number of edges in the corresponding Kronecker graph, which is close to the 
true number of edges of the As-RouteViews graph. 

Last, comparing the log- likelihood at the MLE and the BIC score in Table 9.2, 
we notice that the log-likclihood heavily dominates the BIC score. This means that 
the size of the initiator matrix (number of parameters) is so small that overfitting 
is not a concern. Thus, we can just choose the initiator matrix that maximizes 
the likelihood. A simple calculation shows that one would need to take initiator 
matrices with thousands of entries before the model complexity part of the BIC 
score would start to play a significant role. 

We further examine the sensitivity of the choice of the initiator size by the 
following experiment. We generate a stochastic Kronecker graph K on nine pa- 
rameters {N = 3), and then fit a Kronecker graph K' with a smaller number of 
parameters (four instead of nine, N' = 2), and also a Kronecker graph K" of the 
same complexity as K {N" — 3). 

Figure 9.18 plots the properties of all three graphs. Not surprisingly, K" 
(blue) fits the properties of K (red) perfectly as the initiator is of the same size. On 
the other hand, K' (green) is a simpler model with only four parameters (instead of 
nine as in K and K") and still generally fits well: hop plot and degree distribution 
match well, while spectral properties of graph adjacency matrix, especially scree 
plot, are not matched that well. This shows that nothing drastic happens and that 
even a bit too simple model still fits the data well. In general, we observe empirically 
that by increasing the size of the initiator matrix, one does not gain radically better 
fits for degree distribution and hop plot. On the other hand, there is usually an 
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Figure 9.18. 3x3 stochastic Kronecker graphs. 

Given a stochastic Kronecker graph G generated from iV = 3 (circles), 
we fit a Kronecker graph K' with N' = 2 (squares) and K" with N" — 3 
(triangles). Not surprisingly, K" fits the properties of K perfectly as 
the model is of the same complexity. On the other hand, K' has only 
four parameters (instead of nine as in K and K") and still fits well. 



improvement in the scree plot and the plot of network values when one increases 
the initiator size. 



Network parameters over time 

Next, we briefly examine the evolution of the Kronecker initiator for a temporally 
evolving graph. The idea is that, given parameter estimates of a real graph Gt at 
time t, we can forecast the future structure of the graph Gt+x at time t + x, i.e., 
using parameters obtained from Gt, we can generate a larger synthetic graph K 
that will be similar to Gt+x- 

As we have the information about the evolution of the As-Route Views net- 
work, we estimated parameters for three snapshots of the network when it had 
about 2'^ nodes. Table 9.3 gives the results of the fitting for the three temporal 
snapshots of the As-RouteViews network. Notice that the parameter estimates 
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Table 9.3. Parameter estimates of temporal snapshots. 

Parameter estimates of the three temporal snapshots of the As- 
RouteViews network. Notice that estimates stay remarkably stable 
over time. 



Snapshot at time 


N 


M 




Estimates at MLE, 0 




2048 


8794 


-40,535 


[0.981, 0.633; 0.633, 0.048] 




4088 


15,711 


-82,675 


[0.934, 0.623; 0.622, 0.044] 


T-s 


6474 


26,467 


-152,499 


[0.987, 0.571; 0.571, 0.049] 



0 remain remarkably stable over time. This stability means that Kronecker graphs 
can be used to estimate the structure of the networks in the future, i.e., parameters 
estimated from the historic data can extrapolate the graph structure in the future. 

Figure 9.19 further explores this. It overlays the graph properties of the real 
As-RouteViews network at time T3 and the synthetic graphs for which we used 
the parameters obtained on historic snapshots of As-RouteViews at times Ti and 
T2. The agreements are good, demonstrating that Kronecker graphs can forecast 
the structure of the network in the future. 

Moreover, this experiment also shows that parameter estimates do not suffer 
much from the zero-padding of a graph adjacency matrix (i.e., adding isolated nodes 
to make G have N'' nodes). Snapshots of As-RouteViews at Ti and T2 have close 
to 2*^ nodes, while we had to add 26% (1718) isolated nodes to the network at T3 
to make the number of nodes be 2*^. Regardless of this, we see that the parameter 
estimates Q remain basically constant over time, which seems to be independent of 
the number of isolated nodes added. This means that the estimated parameters are 
not biased too much from zero-padding the adjacency matrix of G. 

9.6.5 Fitting to other large real-world networks 

Last, we present results of fitting stochastic Kronecker graphs to 20 large real- world 
networks: large onhne social networks, (Epinions, Flickr, and Delicious), web 
and blog graphs (Web-Notredame, Blog-nat05-6m, Blog-nat06all), Inter- 
net and peer-to-peer networks (As-Newman, Gnutella-25, Gnutella-30), col- 
laboration networks of coauthorships from DBLP (CA-DBLP) and various areas 
of physics (CA-hep-th, CA-hep-ph, CA-GR-QC), physics citation networks (CiT- 
HEP-PH, Cit-hep-th), an email network (Email-Inside), a protein-interaction net- 
work (Bio-Proteins), and a bipartite affiliation network (authors-to-papers, AtP- 
GR-QC). Refer to Table 9.5 in the appendix for the description and basic properties 
of these networks. They are available for download at http:/ /snap. stanford.edu. 

For each data set, we started gradient descent from a random point (random 
initiator matrix) and ran it for 100 steps. At each step, we estimated the likelihood 
and the gradient on the basis of 510,000 sampled permutations where we discard 
the first 10,000 samples to allow the chain to burn in. 
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Figure 9.19. Autonomous Systems (AS) network over time (As- 
ROUTE Views). Overlaid patterns of real As-RouteViews network at time T3 
and the Kronecker graphs with parameters estimated from As-RouteViews at 
time Ti and T2. Notice the good fits which mean that parameters estimated on 
historic snapshots can be used to estimate the graph in the future. 



Table 9.4 gives the estimated parameters, the corresponding log-likelihoods, 
and the wall clock times. All experiments were carried out on a standard desktop 
computer. Notice that the estimated initiator matrices O seem to have almost 
universal structure with a large value in the top left entry, a very small value at the 
bottom right corner, and intermediate values in the other two corners. We further 
discuss the implications of such a structure of a Kronecker initiator matrix on the 
global network structure in the next section. 

Last, Figures 9.20 and 9.21 show overlays of various network properties of real 
and estimated synthetic networks. In addition to the network properties we plotted 
in Figure 9.18, we also separately plot in- and out-degree distributions (as both 
networks are directed) and plot the node triangle participation in panel (c), where 
we plot the number of triangles a node participates in versus the number of such 
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Table 9.4. Results of parameter estimation. 

Parameters for 20 different networks. Table 9.5 gives the description 
and basic properties of the network data sets. Networks are available 
for download at http://snap.stanford.edu. 





N 


M 


Estimated MLE parameters & 




Time 


As- Route Views 


6474 


26 467 


[0.987, 0.571 


0.571, 0.049] 


— 1 52 499 


8m 15s 


AtP-gr-qc 


19 177 


26 169 


[0.902, 0.253 


0.221, 0.582] 


—242 493 


7m40s 


3 JO- Proteins 


4626 


29 602 


[0.847, 0.641 


0.641, 0.072] 


— 1 85 1 30 


43m41s 


Email- Inside 


986 


32 128 


[0.999, 0.772 


0.772, 0.257] 


— 107 283 


lh07m 


CA-GR-QC 


5242 


28,980 


[0.999,0.245 


0.245,0.691] 


-160,902 


14m02s 


As-Newman 


22,963 


96,872 


[0.954,0.594 


0.594,0.019] 


-593,747 


28m48s 


Blog-nat05-6m 


31,600 


271,377 


[0.999,0.569 


0.502,0.221] 


-1,994,943 


47m20s 


Blog-nat06all 


32,443 


318,815 


[0.999,0.578 


0.517,0.221] 


-2,289,009 


52m31s 


CA-hep-ph 


12,008 


237,010 


[0.999, 0.437 


0.437, 0.484] 


-1,272,629 


lh22m 


CA-hep-th 


9877 


51,971 


[0.999,0.271 


0.271,0.587] 


-343,614 


21ml7s 


CiT-HEP-PH 


30,567 


348,721 


[0.994, 0.439 


0.355,0.526] 


-2,607,159 


51m26s 


Cit-hep-th 


27,770 


352,807 


[0.990, 0.440 


0.347,0.538] 


-2,507,167 


15m23s 


Epinions 


75,879 


508,837 


[0.999,0.532 


0.480,0.129] 


-3,817,121 


45m39s 


Gnutella-25 


22,687 


54,705 


[0.746, 0.496 


0.654,0.183] 


-530,199 


16m22s 


Gnutella-30 


36,682 


88,328 


[0.753, 0.489 


0.632,0.178] 


-919,235 


14m20s 


Delicious 


205,282 


436,735 


[0.999,0.327 


0.348,0.391] 


-4,579,001 


27m51s 


Answers 


598,314 


1,834,200 


[0.994, 0.384 


0.414,0.249] 


-20,508,982 


2h35m 


CA-DBLP 


425,957 


2,696,489 


[0.999,0.307 


0.307,0.574] 


-26,813,878 


3h01m 


Flickr 


584,207 


3,555,115 


[0.999, 0.474 


0.485,0.144] 


-32,043,787 


4h26m 


Web-Notredame 


325,729 


1,497,134 


[0.999,0.414 


0.453,0.229] 


-14,588,217 


2h59m 



nodes. (Again the error bars show the variance of network properties over different 
realizations R{Q'^'^) of a stochastic Kronecker graph.) 

Notice that, for both networks and in all cases, the properties of the real 
network and the synthetic Kronecker coincide very well. Using stochastic Kronecker 
graphs with just four parameters, we match the scree plot, degree distributions, 
triangle participation, hop plot, and network values. 

Given the previous experiments from the autonomous systems graph, we only 
present the results for the simplest model with initiator size N = 2. Empirically, we 
also observe that N — 2 gives surprisingly good fits and the estimation procedure 
is the most robust and converges the fastest. Using larger initiator matrices N > 2 
generally helps improve the likelihood but not dramatically. In terms of matching 
the network properties, we also get a slight improvement by making the model more 
complex. Figure 9.22 gives the percent improvement in log-likelihood as we make 
the model more complex. We use the log-likelihood of a 2 x 2 model as a baseline 
and estimate the log-likelihood at the MLE for larger initiator matrices. Again, 
models with more parameters tend to fit better. However, sometimes due to zero- 
padding of a graph adjacency matrix, they actually have lower log- likelihood (as 
seen in Table 9.2). 
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Figure 9.20. Blog network (Blog-nat06all). 
Overlaid patterns of the real network and the estimated Kronecker graph 
using four parameters (2x2 initiator matrix) . Notice that the Kronecker 
graph matches all properties of the real network. 



9.6.6 Scalability 

Last, we also empirically evaluate the scalability of the KronFit. The experiment 
confirms that KronFit runtime scales linearly with the number of edges M in a 
graph G. More precisely, we performed the following experiment. 

We generated a sequence of increasingly larger synthetic graphs on N nodes 
and 8iV edges, and measured the time of one iteration of gradient descent, i.e., 
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Figure 9.21. Who-trusts-whom social network (Epinions). 

Overlaid patterns of the real network and the fitted Kronecker graph 
using only four parameters (2x2 initiator matrix). Again, the synthetic 
Kronecker graph matches all the properties of the real network. 



sample one million permutations and evaluate the gradients. We started with a 
graph on 1000 nodes and finished with a graph on 8 million nodes and 64 million 
edges. Figure 9.23(a) shows that KronFit scales linearly with the size of the 
network. We plot wall-clock time versus size of the graph. The dashed line gives a 
linear fit to the data points. 
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Figure 9.22. Improvement in log-likelihood. 

Percent improvement in log-likelitiood over tlie 2x2 model as we in- 
crease the model complexity (size of initiator matrix). In general, larger 
initiator matrices that have more degrees of freedom help improve the 
fit of the model. 
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Figure 9.23. Performance. 

(a) Processor time to sample one million gradients as the graph grows. 
Notice the algorithm scales linearly with the graph size, (b) BIC score 
for model selection. 
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9.7 Discussion 

Here we discuss several of the desirable properties of the proposed Kronecker graphs. 

Generality: Stochastic Kronecker graphs include several other generators as 
special cases. For 9ij = c, we obtain the classical Erdos-Renyi random graph model. 
For 9ij € {0, 1}, we obtain a deterministic Kronecker graph. Setting the Ki matrix 
to a 2 X 2 matrix, we obtain the R-MAT generator [Cliakraljarti et al. 2004]. In 
contrast to Kronecker graphs, the R-MAT cannot extrapolate into the future since 
it needs to know the number of edges to insert. Thus, it is incapable of obeying the 
densification power law. 

Phase transition phenomena: The Erdos-Renyi graphs exhibit phase tran- 
sitions [Erdos & A. Rcnyi I960]. Several researchers argue that real systems are "at 
the edge of chaos" or phase transition [Bak 199G, Sole & Goodwin 2000]. Stochas- 
tic Kronecker graphs also exhibit phase transitions [Mahdian & Xu 2007] for the 
emergence of the giant component and another phase transition for connectivity. 

Implications to the structure of the large-real networks: Empirically, 
we found that 2x2 initiator {N = 2) fits well the properties of real- world networks. 
Moreover, given a 2 x 2 initiator matrix, one can look at it as a recursive expansion of 
two groups into subgroups. We introduced this recursive view of Kronecker graphs 
back in Section 9.3. So, one can then interpret the diagonal values of 6 as the 
proportion of edges inside each of the groups, and the off-diagonal values give the 
fraction of edges connecting the groups. Figure 9.24 illustrates the setting for two 
groups. 

For example, as shown in Figure 9.24, large a,d and small 6, c would imply 
that the network is composed of hierarchically nested communities, where there are 
many edges inside each community and few edges crossing them [Lcskovec 2009]. 
One could think of this structure as some kind of organizational or university hi- 
erarchy, where one expects the most friendships between people within same lab, a 
bit less between people in the same department, less across different departments, 
and the least friendships to be formed across people from different schools of the 
university. 

However, parameter estimates for a wide range of networks presented in Ta- 
ble 9.4 suggest a very different picture of the network structure. Notice that for 
most networks a ^ b > c :S> d. Moreover, a « 1, 5 « c « 0.6 and d sa 0.2. 
We empirically observed that the same structure of initiator matrix 9 also holds 
when fitting 3 x 3 or 4 x 4 models. Always the top left element is the largest, 
and then the values on the diagonal decay faster than the values off the diago- 
nal [Lcskovec 2009]. This suggests a network structure that is also known as core- 
periphery [Borgatti k Everett 2000, Holme 2005], the jellyfish [Tauro ct al. 2001, 
Sigauos ct al. 2006], or the octopus [Chung & Lu 200G] structure of the network, as 
illustrated in Figure 9.24(c). 

All of the above basically say that the network is composed of a densely linked 
network core and the periphery. In our case, this would imply the following structure 
of the initiator matrix. The core is modeled by parameter o and the periphery by d. 
Most edges are inside the core (large a) , and very few between the nodes of periphery 
(small d). Then there are many more edges between the core and the periphery 
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(b) Two recursive communities 




(c) Core-periphery 

Figure 9.24. Kronecker communities. 

2x2 Kronecker initiator matrix (a) can be thought of as two communities 
where there are a and d edges inside each of the communities and b 
and c edges crossing the two communities as illustrated in (b). Each 
community can then be recursively divided using the same pattern, (c) 
The onionlike core-periphery structure where the network gets denser 
and denser as we move towards the center of the network. 



than inside the periphery (6, c > d) [Lcskovcc 2009] . This is exactly what we see 
as well. In the spirit of Kronecker graphs, the structure repeats recursively — the 
core again has the dense core and the periphery, and so on. Similarly, the periphery 
itself has the core and the periphery. 

This structure suggests an onionlike nested core-periphery [Lcskovcc ct al. 2008b, 
Lcsko^'C(■ ct al. 2008a] network structure as illustrated in Figure 9.24(c), where the 
network is composed of denser and denser layers as one moves towards the center 
of the network. We also observe a similar structure of the Kronecker initiator when 
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fitting a 3 X 3 or 4 X 4 initiator matrix. The diagonal elements have large but 
decreasing values with off-diagonal elements following the same decreasing pattern. 

These Kronecker initiators imply that networks do not break nicely into hi- 
erarchically organized sets of communities that lend themselves to graph parti- 
tioning and community detection algorithms. On the contrary, it appears that 
large networks can be decomposed into a densely linked core with many small 
periphery pieces hanging off the core. Our recent results [Lcskovcc ct al. 20U8b, 
Lcskovcc ct al. 200Sa] make a similar observation (but based on a completely differ- 
ent methodology based on graph partitioning) about the clustering and community 
structure of large real-world networks. 

9.8 Conclusion 

In conclusion, the main contribution of this work is a family of models of network 
structure that uses a nontraditional matrix operation, the Kronecker product. The 
resulting graphs have (a) all the static properties (heavy-tailed degree distribution, 
small diameter, etc.) and (b) all the temporal properties (densification, shrinking 
diameter) that are found in real networks. In addition, we can formally prove all of 
these properties. 

Several of the proofs are extremely simple, thanks to the rich theory of Kro- 
necker multiplication. We also provide proofs about the diameter and effective 
diameter, and we show that stochastic Kronecker graphs can mimic real graphs 
well. 

Moreover, we also presented KronFit, a fast, scalable algorithm to estimate 
the stochastic Kronecker initiator, which can then be used to create a synthetic 
graph that mimics the properties of a given real network. 

In contrast to earlier work, our work has the following novelties: (a) it is among 
the few that estimates the parameters of the chosen generator in a principled way, 
(b) it is among the few that has a concrete measure of goodness of the fit (namely, 
the likelihood), (c) it avoids the quadratic complexity of computing the likelihood 
by exploiting the properties of the Kronecker graphs, and (d) it avoids the factorial 
explosion of the node correspondence problem by using the Metropolis sampling. 

The resulting algorithm matches well all the known properties of real graphs. 
As we show with the Epinions graph and the AS graph, it scales linearly on the 
number of edges, and it is orders of magnitudes faster than earlier graph-fitting 
attempts: 20 minutes on a commodity PC versus two days on a cluster of 50 
workstations [Bczakova ct al. 2006]. 

The benefits of fitting a Kronecker graph model into a real graph are several: 

• Extrapolation: Once we have the Kronecker generator O for a given real matrix 
G (such that G is mimicked by O*^*^), a larger version of G can be generated 
by 9®'=+!. 

• Null-model: When analyzing a real network G, one often needs to asses the 
significance of the observation. O'*'' that mimics G can be used as an accurate 
model of G. 
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• Network structure: Estimated parameters give insight into the global network 
and community structure of the network. 

• Forecasting: As we demonstrated, one can obtain Q from a graph Gt at time t 
such that G is mimicked by O***^. Then 8 can be used to model the structure 
of Gt+x in the future. 

• Sampling: Similarly, if we want a realistic sample of the real graph, we could 
use a smaller exponent in the Kronecker exponentiation, like B^*^"!. 

• Anonymization: Since Q'^^ mimics G, we can publish O***^, without revealing 
information about the nodes of the real graph G. 

Future work could include extensions of Kronecker graphs to evolving net- 
works. We envision formulating a dynamic Bayesian network with first order 
Markov dependencies, where parameter matrix at time t depends on the graph 
Gt at current time t and the parameter matrix at time t — 1. Given a series of 
network snapshots, one would then aim to estimate initiator matrices at individual 
time steps and the parameters of the model governing the evolution of the initiator 
matrix. We expect that, on the basis of the evolution of the initiator matrix, one 
would gain greater insight into the evolution of large networks. 

A second direction for future work is to explore connections between Kronecker 
graphs and Random Dot Product graphs [Young & Scheinerman 2007, Nickc^l 2008]. 
This also nicely connects with the "attribute view" of Kronecker graphs as described 
in Section 9.3.5. It would be interesting to design methods to estimate the individual 
node attribute values as well as the attribute-attribute similarity matrix (i.e., the 
initiator matrix). If some networks node attributes are already given, one could then 
try to infer "hidden" or missing node attribute values and this way gain insight into 
individual nodes as well as individual edge formations. Moreover, this would be 
interesting as one could further evaluate how realistic is the "attribute view" of 
Kronecker graphs. 

Last, we also mention possible extensions of Kronecker graphs for model- 
ing weighted and labeled networks. Currently stochastic Kronecker graphs use a 
Bernoulli edge generation model, i.e., an entry of big matrix V encodes the pa- 
rameter of a Bernoulli coin. In a similar spirit, one could consider entries of V to 
encode parameters of different edge generative processes. For example, to gener- 
ate networks with weights on edges, an entry of V could encode the parameter of 
an exponential distribution, or in the case of labeled networks, one could use sev- 
eral initiator matrices in parallel and this way encode parameters of a multinomial 
distribution over different node attribute values. 

Appendix: Table of networks 

Table 9.5 lists all the network data sets that were used in this chapter. We also 
computed some of the structural network properties. Most of the networks are 
available for download at http://snap.stanford.edu. 
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Table 9.5. Network data sets analyzed. ■ 

n 
o 

n_ 

Statistics of networks considered: number of nodes N] number of edges E, number of nodes in largest connected ^ 
component Nc, fraction of nodes in largest connected component Nc/N, average clustering coefficient C; diameter o 
D, and average path length D. Networks are available for download at http://snap.stanford.edu. 



Network 


N 


E 






c 


D 


D 


Description 


yocial networks 


Answers 


598,314 


1,834,200 


488,484 


0.82 


0.11 


22 


5.72 


Yalioo! Answers social network [Losku\"cc oi ai. JUOSbJ 


Delicious 


205,282 


436,735 


147,567 


0.72 


0.3 


24 


6.28 


dcl.icio.us social network [Lcskovcc et al. 2008b] 


Email- Inside 


986 


32,128 


986 


1.00 


0.45 


7 


2.6 


European research organization email network [Leskovec et al. 2007a] 


Epinions 


75,879 


508,837 


75,877 


1.00 


0.26 


15 


4.27 


Who-trusts-whom graph of epinions.com [Richardson et al. 2003] 


Flickr 


584,207 


3,555,115 


404,733 


0.69 


0.4 


18 


5.42 


Flickr photo-sharing social network [ mnar et al. 2006] 


Information (citation) networks 


Blog-nat05-6m 


31,600 


271,377 


29,150 


0.92 


0.24 


10 


3.4 


Blog-to-blog citation network (6 months of data) [Leskovec et al. 2007b] 


Blog-nat06all 


32,443 


318,815 


32,384 


1.00 


0.2 


18 


3.94 


Blog-to-blog citation network (1 year of data) [Leskovec et al. 2007b] 


CiT-HEP-PH 


30,567 


348,721 


34,401 


1.13 


0.3 


14 


4.33 


Citation network of ArXiv hep-th papers [CTchrke et al. 2003] 


CiT-HEP-TH 


27,770 


352,807 


27,400 


0.99 


0.33 


15 


4.2 


Citations network of ArXiv hep-ph papers [Gehrke et al. 2003] 


Coiiaboration networks 


CA-DBLP 


425,957 


2,696,489 


317,080 


0.74 


0.73 


23 


6.75 


DBLP coauthorship network [Backstrom et al. 2006] 


CA-GR-QC 


5242 


28,980 


4158 


0.79 


0.66 


17 


6.1 


Coauthorship network in gr-qc ArXiv [Leskovec et al. 2005b] 


CA-HEP-PH 


12,008 


237,010 


11,204 


0.93 


0.69 


13 


4.71 


Coauthorship network in hep-ph ArXiv [Leskovec et al. 2005b] 


CA-HEP-TH 


9877 


51,971 


8638 


0.87 


0.58 


18 


5.96 


Coauthorship network in hep-th ArXiv [Leskovec et al. 2005b] 


Web graphs 


Web-Notredame 


325,729 


1,497,134 


325,729 


1.00 


0.47 


46 


7.22 


Web graph of University of Notre Dame jj'Vlbert et al. 1999] 


internet networks 


As-Newman 


22,963 


96,872 


22,963 


1.00 


0.35 


11 


3.83 


AS graph from Newman [ncwman07netdata] 


As-RouteViews 


6474 


26,467 


6474 


1.00 


0.4 


9 


3.72 


AS from Oregon Route View [Leskovec et al. 2005b] 


Gnutella-25 


22,687 


54,705 


22,663 


1.00 


0.01 


11 


5.57 


Gnutella P2P network on 3/25 2000 [Ripeanu et al. 2002] 


Gndtella-30 


36,682 


88,328 


36,646 


1.00 


0.01 


11 


5.75 


Gnutella P2P network on 3/30 2000 [Ripeanu et al. 2002] 


Bipartite networks 


AtP-gr-qc 


19,177 


26,169 


14,832 


0.77 


0 


35 


11.08 


Affiliation network of gr-qc category in ArXiv [Leskovec >' ' '^01171)] 


Biologicai networks 


Bio-Proteins 


4626 


29,602 


4626 


1.00 


0.12 


12 


4.24 


Yeast protein interaction network [Colizza et al. 2005] 
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Chapter 1 0 

The Kronecker Theory of 
Power Law Graphs 



Jeremy Kepner* 



Abstract 

An analytical theory of power law graphs is presented based on the Kro- 
necker graph generation technique. Explicit, stochastic, and instance 
Kronecker graphs are used to highlight different properties. The anal- 
ysis uses Kronecker exponentials of complete bipartite graphs to for- 
mulate the substructure of such graphs. The Kronecker theory allows 
various high-level quantities (e.g., degree distribution, betweenness cen- 
trality, diameter, eigenvalues, and iso-parametric ratio) to be computed 
directly from the model parameters. 

10.1 Introduction 

Power law graphs are ubiquitous and arise in the Internet [Faloutsos 1999], the web 
[Brodcr 2000], citation graphs [Rcdnor 1998], and online social networks (see 
[Chakrabarti 2004]). Power law graphs have the general property that the his- 
tograms of their degree distribution Deg{) fall off with a power law and are approx- 
imately linear in a log-log representation. Mathematically this observation can be 
stated as 

Slope\\og{C ount[Deg{G)])] « —constant 

*MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420 (kepner Oil .mit . edu). 
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The analytical theory describing the specific structure of these graphs is just 
beginning to be developed. In this work, an analytical framework is proposed and 
used to illustrate the precise substructures that may exist within power law graphs. 
In addition, this analytical framework will also allow many higher level statistical 
quantities to be computed directly. 

There are a variety of techniques for generating power law graphs. The simu- 
lations produced by these techniques are used to test out techniques for analyzing 
real-world graphs. This work focuses on the innovative Kronecker product tech- 
nique [Lcskovcc 2005, Chakrabarti 2004], which generates power law graphs and 
with enough tunable parameters to provide detailed fits to real-world graphs. The 
Kronecker product approach reproduces a large number of statistical properties of 
real graphs: degree distribution, hop plot, singular value distribution, diameter, 
densification, etc. In addition, and perhaps more importantly, it is based on an 
adjacency matrix representation of graphs, which provides a rich set of theoretical 
tools for more detailed analysis [Gilbert 200G]. To review, for a graph G — {V,E) 
with N vertices and M edges, the N x N adjacency matrix A has the property 
A(i, j) = 1 if there is an edge from vertex Vi to vertex Vj and is zero otherwise. 

The outline of this work is as follows. First, an overview of some results is 
presented. Second, the Kronecker graph generation algorithm is reviewed. Next 
are some basic results on simplified Kronecker graphs based on fully connected 
bipartite graphs B(n, m) of sets with n and m vertices. The fifth section presents 
the fundamental analytic constructs necessary for a more sophisticated analysis of 
Kronecker graphs. Section 10.6 gives results for a more complex model of Kronecker 
graphs. Finally Section 10.7 discusses the implications of this work. 

1 0.2 Overview of results 

The principal results presented in this work are 

1. The basic properties of Kronecker products of graphs, which include 

(a) The Kronecker product of two bipartite graphs 

PB(B(ni, mi) (g) B(n2, 7712)) = B(nin2, 17111712) U B(n2r7ii, niTO2) 

(b) The permutation functions for manipulating Kronecker graphs. These 
include the bipartite Pb , recursive bipartite Pj. , and pop Ppop permuta- 
tions. 

2. The construction of a useful analytic model of Kronecker graphs of the form 
G{n, m)®^ where G(rt, m) = /3B(n, m) -|-q;I-|-71, and a useful approximation 
to this model is given by 

G{n,m)®'' K ^'^B^'= + "fc"(aI + 7l)®^'^-iB^'=-i 
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3. The graph substructures generated by B®*^ are the union of many smaller 
bipartite graphs 

Ps(B(n,m)«'^) = |jBK(i),mfc(i)) 

or equivalently 

k-i 

B(n,ro)®*= = U U B(n*^-''m'',n''m*^-'') 

r=0 

4. The graph substructures generated by (B + 1)®*^ are 

r— 1 

The first and second order terms are related by 

where xf is the connection between blocks of vertices and Xiii2 the strength 

of the connection. 

5. The above graph substructures can be used to compute higher level statistics. 
For example, given n > m and r = 0, . . . ,k 

(a) The degree distribution for Deg{R{n,m)'^^) is 

Count[Deg = rfm!'-''] = {^^n'^-'^nf 

(b) The betweenness centrality distribution for Cb(B(n, m)®*^) is 

Count[Cb = (n/mf'-^{n^-''m'' - 1)] = {^^n^-''nf 

(c) The degree distribution of (B + 1)®*' is given by 

Count[Deg = {n + Ifim + if-''] = {^^n''-''wJ' 

6. Various additional results 

(a) A fast, space efficient Kronecker graph generation algorithm 

(b) The degree distribution of an arbitrary Kronecker matrix based on Pois- 
son statistics. 
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10.3 Kronecker graph generation algorithm 

The Kronecker product graph generation algorithm ([Leskovec 2005, Chakrabarti 2004] 
is quite elegant and can be described as follows. First, let A € R'^bMcxNbNc ^-q g 
^MbxNb^ and C e kA^c-xWc^ then the Kronecker product is defined as follows 
[Van Loan 2000]: 



A = B(8)C = 



B(1,1)C 
B(2,1)C 



B(1,2)C 
B(2,2)C 



V B(iVs,l)C B(iVB,2)C 



B(1,Mb)C \ 
B(2,Mb)C 

B{Nb,Mb)C J 



Now let G S M^^^ be an adjacency matrix. The Kronecker exponent to the power 
k is as follows 

which generates an N'^ x N'^ adjacency matrix. This simple model naturally pro- 
duces self-similar graphs, yet even a small G matrix provides enough parameters to 
fine-tune the generator to a particular real- world data set. 

At this point, it is worth noting that the model G***^ will be used in multi- 
ple contexts. If G is an explicit adjacency matrix consisting of only I's and O's, 
then likewise G®*"' will also be such an "explicit" adjacency matrix. Most of the 
subsequent theoretical analysis will be on these matrices, which reveal the graph 
substructures and how the higher level statistical quantities vary with the parame- 
ters of the model. Of course, such a precisely structured adjacency matrix does not 
correspond to any real-world, organically generated graph. 

To produce more realistic statistical models, let 0 < G{i,j) < 1 be a matrix of 
probabilities. The "stochastic" adjacency matrix generated by G®*^ then contains 
at each entry the probability that an edge exists from vertex i to j. A limited 
amount of theoretical analysis will be done on these matrices. 

To create a specific "instance" of this matrix, edges are randomly selected 
from the stochastic matrix. One of the powerful features of the Kronecker product 
generation technique is that an arbitrarily large instance of the graph can be created 
efficiently without ever having to form the full stochastic adjacency matrix. 



1 0.3.1 Explicit adjacency matrix 



An example worth closer consideration is when G represents a star graph with n 
spokes {N = n+1 total vertices) . The Kronecker product of such a graph naturally 
leads to graphs with many of the properties of real- world graphs. Denote the Nx N 
adjacency matrix for a star graph as 



S(n + 1) = S{N) 



fO 1 
1 0 



Vi 0 



1\ 



0 / 
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Likewise, let I(A^) denote the N x N identity matrix 



( 1 0 
0 1 



0 



Then set 
which imphes 



Vo 0 ••• ly 

G(iV) = S(iV) 



Figure 10.1 shows the expUcit adjacency matrix formed by G*^'"* for = 4. 

10.3.2 Stochastic adjacency matrix 

A stochastic adjacency matrix sets the probabiUty of edges between any two vertices. 
In the case of iV = 4 the total number of free parameters is 16. To simplify this 
situation, we separate the entries into two values consisting of a foreground (/3,a) 
and a background (7) contribution. This results in a model of the form 

G(7V) = /3S(7V) + a\{N) + 7l(iV) 

where 1 > /3, a ^ 7 > 0, and l(A^) is N x N matrix of all ones. Figure 10.1 shows 
the stochastic adjacency matrix, formed by G'*'^ for = 4. 

The stochastic representation does lend itself to some analytical treatment. 
For instance, define the following probability density functions for the rows and 
columns of G 



N 

-'(1,j) = ^G(z,j)/EG 

N 



(z,1) = 5]G(*,j)/EG^ 

The probability of there being an in/out edge for a particular vertex is given by pi 
where 



where the "row" and "col" superscripts have been dropped. The expected number 
of in/out edges for a particular vertex is given by a Poisson distribution with an 
expected value of Xi 

Proh{ni) = A^e-^'/n! 

where A = N^p and A^e is the number of total edges in the graph. The degree 
distribution Deg of the adjacency matrix is obtained by summing these over all the 
vertices 



Count[Deg = n] = ^A"e~^7n! 
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Explicit Stocliastic Instance 




Figure 10.1. Kronecker adjacency matrices. 

Explicit, stochastic, and instance matrices for a star plus identity matrix: 
G = S(4) + 1(4), where = 8Ny. 



The above sum requires a high amount of numerical precision to compute, so the 
following more practical formula should be used 



Count[Deg 



E 

i=l 



exp 



n(ln(pO+ln(iVe))-A,- 5^ ln(r 



n' = l 



This example illustrates some of the analysis we can perform on the stochas- 
tic adjacency matrix. However, while it is possible to compute some higher level 
statistical quantities, it is more difhcult to get at the detailed substructure of the 
graph. 
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10.3.3 Instance adjacency matrix 

One of the advantages of the Kronecker method is that a specific instance of the 
graph can be generated efficiently without ever having to form the entire adjacency 
matrix. To create a single edge eij from the stochastic adjacency matrix created by 
Qigifc requires generating a vertex pair i and j. Let and be fc-bit represen- 
tations of i and j where each bit can take N values. Furthermore, let [k') and 
3k (^') fc'-bit of the representation. In short 

*f(fc'),jf(fc')e{0,...,A^-i} Vfc'e{l,...,fc} 

Each bit is randomly set using the following formulas 

(fc') = arg,,(G7"'(z', 1) < r, < G7"'(*' + 1, 1)) 

3^{k') = arg^.,(G-'(zf (fc'), j') < r, < Gr'(*f (fc'),/ + 1)) 
where random numbers ri,rj [/[0, 1] are drawn from a uniform distribution. 
Qrow jg ^YiQ cumulative distribution function that there is an edge in a given row. 
Qrow ^j^g joint cumulative distribution function that there is an edge in a given 
column. The row cumulative distribution function is computed as follows 

Gl^^ii, 1) G™-(z -1,1)+ g^^ii - 1, 1) 

where i = 2, . . . , N + 1, and G™'"(1, 1) = 0. Likewise, the column cumulative joint 
distribution function is 

Gf (i, j) = g;°-(z, J - 1) + g™-!*, J - 1) 

where i = 1, . . . , iV, j = 2, . . . , TV + 1, G™'(i,l) = 0, and 

,^ ^ 1 

y j^G g™"'(i,l) 

The above formulas for and jj^ are repeated k times to fill in all the bits, af- 
ter which it is a simple matter to compute i and j from their bit representations. The 
procedure culminates by setting the corresponding value in the adjacency matrix 
A{i,j) — 1. The whole procedure is repeated for every edge to be created. Edges 
can be created completely independently provided the random number generators 
are seeded differently. Figure 10.1 shows an instance adjacency matrix formed by 
Qi»3 for = 4 with Ne = 8iV. Figure 10.2 shows the degree distribution for a 
stochastic Kronecker graph with a million vertices and the corresponding predicted 
distribution obtained from the stochastic adjacency matrix. The peak structure is 
a reflection of the underlying analytic structure of the Kronecker graph (see next 
section). 

10.4 A simple bipartite model of Kronecker graphs 

Consider the model generated by taking the Kronecker product of a complete bi- 
partite graph 

G = B(n,m) 
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Figure 10.2. Stochastic and instance degree distribution. 

The degree distribution for a stochastic Kronecker grapli with a milhon 
vertices (circles) and the corresponding predicted distribution obtained 
from the stochastic adjacency matrix (solid line). The peak structure is 
a reflection of the underlying analytic structure of the Kronecker graph. 



This will be a reasonable model of certain simple Kronecker graphs and will be 
a useful building block for modeling more complex Kronecker graphs. This work 
builds on the original work of Weischel [W'eisclicl 19G2], who first looked at the 
relation between Kronecker products and graphs. 

10.4.1 Bipartite product 

Let the adjacency matrix of a bipartite graph be denoted by B(7i, to) 



B(n, to) = 




Likewise, if the arguments of B() are matrices A and C, then 

B(A,C).(« - ) 
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Furthermore, let 



and it is clearly the case that 

B(n,m) =B(1'"^") 

The adjacency matrix of a star graph is denoted by S(n + 1) and is related to B by 

S(n + 1) =B(n,l) 
The Kronecker product of two bipartite graphs is given by 



PB(B(ni, mi) (g) B(n2, m2)) = B(nin2, mim2) U B(n2mi, niTO2) 



where Pb = pB(ni,mi.n2,m2) is bipartite permutation function (see next section). 
The union notation is defined as follows 

» ^ /AO 
0 C 

and has the additional form 



V 0 



For convenience, where it is not necessary to keep track of the precise permutations, 
p . . 

the = notation is used. For the bipartite product this would be written as 

B(ni, TOi) ® B(n2, "^2) = B(nin2, TOim2) U B(n2TOi, nim2) 

The different representations of the Kronecker product (graphical, matrix, and al- 
gebraic) are shown in Figure 10.3. 

10.4.2 Bipartite Kronecker exponents 

Ignoring the required permutations for the time being, the Kronecker powers of a 
bipartite graph are as follows 

B(n,m)®2 P B{n^,m^)[jB{nm,nm) 

and 

2 

B(n,m)®3 P B{n\m^)[jB{n^m,nm^)[jB{nm^,n^m) 
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0 





® 



B(5,1) 0 B(3,1) = 



B(15,1) 

U 

B(3,5) 



Figure 10.3. Graph Kronecker product. 

Complementary representations of the Kronecker product of two graphs. 

which generaUzes to 




where 



fc! 



J r\{k — ry. 

In short, the Kronecker exponential of a bipartite graph is the union of many smaller 
bipartite graphs. Note: to keep the expression for the coefficients a simple binomial 
coefficient, the fact that some terms can be combined further is ignored, i.e.. 



)Ub( 
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The above formula clearly shows the substructures and how they grow with k. In 
particular, it is worth mentioning that going from fc to A; + 1 results in mostly new 
structures that did not exist before. 

The above general expression has the following special cases 

2fc-l 

B(n,n)®'= = IJ B(r^^n'=) 

and 

B(n,l)®'= = U U B(r^'=-^n'■) 

r=0 



10.4.3 Degree distribution 

The degree distribution of the above expression is 

Deg{B{n,m)®'') = |J U Deg{B{n''-''m'' ,n''m''-'')) 

r=0 

where 



Deg{'B{n,m)) = { ... m ... , ... n ...} 

and so 



DegiBin^-'-rrf ,rfm''-'')) = { ... rCm^-'' ... , ... n^^'W ... } 
The histogram of the degree distribution of the right set can be computed as follows 

Count[Deg = n^-'^mT] =(^~ ^\n'm!'-'' 



r 

Substituting f = k — r gives 

Count[Deg = nTm^-^] = Q I J)"''"^"^^ 

where f = 1, . . . , fc. The histogram of the degree distribution of the left set can be 
computed as follows 



Count[Deg = rfmf' ''1 = 1^ ^ \nf' "^nf 



r 



Substituting f = r gives 



Count[Deg = rfm!' = _^ \ ""rrf 



r 
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where f — 0,...,fc — 1. Combining the formulas for the left and right sets and 
setting f ^ r produce 



r=l,...,k r=0,...,k-l 



Count[Deg = rfm!' ''] = [ 




Finally, we can simplify the binomial coefficients to obtain the following elegant 
expression 



Count[Deg = vJ'm'' = 




where r = 0, . . . , /c. 

For the case where B = B(n,n), this yields the trivial distribution of all 
vertices having the exact same degree (which is clearly not a power law). For the 
more interesting case where B — B(n, 1), the distribution is 



Count[Deg — •nJ'] ~ 




Under the assumption n > m, the above degree distribution is ordered in r 
and the slope of this degree distribution can be readily computed 

c; r r^k-r , r+l^k-r-l-i ^ , logri [(^ ~ + 1)] 

Slope\n m n m =— IH — - — 

log„[n/m] 

which for the first and last points gives 

Slope[n^m^ n^'iv}'^^] = -1 + log„(fc)/ log„(n/TO) 

Slopelin!''^ n''] = -1 - log„(fc)/ log„(n/m) 

More interestingly, this distribution has the property that over any symmetric in- 
terval r ^ k — r 

which is why it can be closely associated with a power law graph (see Figure 10.4). 

10.4.4 Betweenness centrality 

The same technique can be applied to any other statistic of interest. For example, 
consider the betweenness centrality metric that is frequently used in graph analysis 
[Freeman 1977, Brandes 2001, Bader 2006]. This metric is defined as 

a(.)= E ^ 

— O^st 
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12 




logn(vertex degree) 



Figure 10.4. Theoretical degree distribution. 

Degree distribution derived from the Kronecker product of bipartite 
graplis. The slope over any symmetric interval is always —1. 



where ast is the number of shortest paths between vertex s and vertex t, and <Jst{v) 
is the number of shortest paths between vertex s and vertex t that pass through 
vertex v. Applying Ch leads to 

where 

n m 

. ^ . " ^ 

C{,(B(n, m)) = { ... m{m—l)/n ... , ... n(n — l)/m ... } 
for n,m > 1. For B(n, 1), this becomes 



Cfc(B(n,l)) ={ ... 0 ... , 
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In the general case, the above can be written as 
CbiB{n''-''m'',n''m''-'')) 



rfm^ ^{n^m'^ ^ — 1) n'' '^rrf{n'' '"m'" — 1) 

* * * ^k—Tn^T ' ' ' ' * * * ,^h—r ' ' ' J 



= { ... (m/n)'=-2''(n''rn^-''-l) ... , ... (n/m)*=-2''(n*=-W - 1) ... } 

The histogram of the betweenness centrality distribution of the right-hand set can 
be computed as follows 

Count[Cb = (n/m)'=-2'^ m'' - 1)] = ^ ^^n'^m'^"'' 

Likewise, the betweenness centrality distribution of the left-hand set is 

Count[Cb = {n/mf'-''{Tfm''-'' - 1)] ={^~ ^^"''m*-'' 

Combining the above two formulas by using the same steps that were used in the 
previous section to compute the degree distributions results in the following 



Count[Cb = {nlmf''-^{n^m^-^ -!)]=( ^\n^~^wr 



where r = 0, . . . , A;. 

Selected values of the histogram for the degree distribution and the between- 
ness centrality of B(n, X)®^ are as follows 



Deg 




Count 


1 
















kn 






1 



10.4.5 Graph diameter and eigenvalues 

The diameter of a graph DiamO is the maximum of the shortest path between any 
two vertices (assuming all vertices arc reachable). Clearly 

Diam{B) = 2 

Likewise 

Diam{A U C) = oo 
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since there are unreachable nodes. By this resuh, the diameter of the Kronecker 
exponential of a bipartite graph is 

A bipartite graph has just two nonzero eigenvalues 
etg{B) = {(nm)i/2,-(nm)i/2} 

The eigenvalues of B®'' fall directly from Theorem 2 of [Leskovcc 2005] which states 
that the eigenvalues of a Kronecker product are just the Kronecker product of the 
eigenvalues. Thus, B*^*^ has 2*^ nonzero eigenvalues 



eig{B{n, m)®"=) = {(nm)'=/2, . . . , (nm)'=/2, -(nm)'^/^, . . . , 

10.4.6 Iso-parametric ratio 

Another property of interest is the "surface" to "volume" ratio of subsets of the 
graph [Chung 2005], which can be measured using various iso-parametric ratios. 
One such ratio is given by 

where X is a subset of vertices with set complement X = V \ X, and A is the 
adjacency matrix of the graph. For an unweighted symmetric graph, this can also 
be written as 

Applying this to the graph B(n, m)'^^ results in a few interesting cases. 

Consider the case when the subset consists of only those vertices in nk(i) (or 

TOfe(j)) 

IsoPar{nk[i)) = 2 = — 2 = oo 

In other words, since nk(i) is part of a complete bipartite subgraph, there are no 
edges between the vertices in nk{i). In general, this will also be true of a small 
random subset of vertices (i.e., most vertices are not connected to each other). 

Next, consider the case when the subset consists of all the vertices in the 
subgraph B(rH;(i), mfc(i)): 

IsoPar{nk{i) U i)) = - — - 2 = ^ ^ -2 = 0 

In other words, there are no connections to any other vertices outside this subgraph. 
Likewise, for any random subgraphs 'B{nk{i),mk{i)) and 'B{nk{i'),mk{i')) 

IsoPar{nk{i) U mk{i) U nk{i') U mk{i)) = —r — r -2 = 0 
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1 0.5 Kronecker products and useful permutations 

This section gives additional basic results on Kronecker products that will be useful 
in exploring more complex Kronecker graphs. 

Other Kronecker product identities are as follows. I(A'') is the N x N identity 
matrix. The Kronecker product of the identity with itself is 

I{Ni)^I{N2) = I{NiN2) 
The Kronecker product of I with another matrix C is 

I{N) o C = U C 

Let l(A^) denote the N x N matrix with all values set equal to 1. The Kronecker 
product of 1 with itself is 



l(iVi) (g) l(iV2) = l{NiN2) 



10.5.1 Sparsity 



The sparsity denotes the fraction of the entries in the matrix that are nonzeros. For 
the identity matrix the sparsity is 

a(I) = ai = ^ = 7V-^ 
Likewise, for a bipartite matrix 

<7(B(n,m)) =(TB = -j^ 
The sparsity is related to the Kronecker product as follows 

C7(A) = a(B)a(C) 

where A = B (g) C. The sparsity is related to the Kronecker exponent as follows 

ct(G®*=) = (7(G)'= 

10.5.2 Permutations 

The permutation function P will be used in the following overloaded manner (always 
in the context of square matrices) 

i = P-\i'), 
P(A) permutes A, 

PAP"^ means that P is a permutation matrix. 
P also has the identities P(I) = I, P(l) = 1, and P(B + C) = P(B) + P(C). 
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10.5.3 Pop permutation 

If B e iJ^BxJVB^ Q g jiNcxNc^ and 

A = B(g)C 

then there exists a permutation with 

Pa(A) = Pb(B)®Pc(C) 

where Pb and Pc are arbitrary permutations of B and C. Furthermore, Pa can be 
computed from Pb and Pc as follows 

PA{tA) = {PbHia/NcI + 1) - l)Nc + Pc(((«A - 1) mod Nc) + 1) 
= (Pb(4,,(u)) - l)Nc + Pc(4o(^a)) 

where 

^]vb(0 = (*-!) mod Nb 

In general, Pa in the above context will be denoted as 

Ppop{P^,Pa)(B C) = Pb(B) ® Pc(C) 
because it allows "popping" the permutations outside the Kronecker product. 

10.5.4 Bipartite permutation 

The bipartite permutation Pb is the key permutation that allows reorganizing the 
Kronecker product of two bipartite graphs into separate disconnected graphs. The 
formula for Pb = PB(m,rm,n2,m2) is 







ii 






^2 






11211 + 12 


0,.. 


. ,ni - 


1 


1,.. 


■ ,ri2 


N2i\ +n2 + i2 


{n\n2 + mim2) + n2i\ + 12 


0,.. 


. ,ni - 


1 


1,.. 


.,1712 


N2{ni + ii) + 12 


(nin2 + mim2) + n2ii + 12 


0,.. 


. ,mi - 


- 1 


1,.. 


• ,^2 


N2{ni + i) + n2 + j 


n\n2 + m2ii + 12 


0,.. 


. ,mi - 


- 1 


1,.. 


. ,7712 



where A^i = ni + mi and N2 =112 + m2- 

1 0.5.5 Recursive bipartite permutation 

The recursive bipartite permutation allows the precise construction of B®^ as a 
union of 2*^ bipartite graphs 

Ps(B(n,m)®'=)= |jBK(z),mfc(z)) 
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where nk{i) and mk{i) are defined by the recursive relations 

nfe(2i - 1) nk-i{i)n 
nfe(2i) = mk-i{i)n 
mk{2i — 1) = mfe_i(i)m 
mk{2i) = Jik^i{i)m 

with 

ni(l) = n and — 1 

The permutation P function can be computed as follows. Let 

Pfe (B (n , m) ^'^ ) = Pfe (P— (B (n, m) 1 ) $5 B (n, m) ) 

where the above permutations are applied recursively until 

P2(B(n, m)^2) ^ B(n^ m^) (J B{nm, nm) 

which implies 

P2 = P2 = Pb w, n, m) 

In other words, Pj. is the recursive composition of each of the constituent permuta- 
tions P/j. At level fc — 1, there are the following terms 



P— (B(n,m)«'=-i) = IJ B{nk^,{i),mk-i{i}) 
1=1 

multiplying by B(n, m) to get B(n, m)'^'^ implies each of the above terms will be 
permuted by the bipartite pcrmvitation function 

Pk,i — -fB(nfc_i(i),mfc_i(i),n,7Ti) 

These functions are then concatenated together with the union function to produce 
all of Pk 

rjfc-l 



2" 

Pfc = U Pk, 



The aggregate permutation P^ is then computed using Ppop 

Pfcl(zfe) = PkiiPf-jilik/Nl + 1) - 1)A^ + {{ik ~ 1) mod iV) + 1) 
The effect of this permutation can be seen in Figure 10.5. 
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Figure 10.5. Recursive bipartite permutation. 

The top figure sfiows tfie unpermuted (B + I)®*^ for n = 4, m = 1, 
and k — A. The bottom figure shows the same adjacency matrix after 
applying the recursive bipartite permutation P^. 
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10.5.6 Bipartite index tree 

The functions nk{i) and m,k{i) form a tree wliicli defines tlie overall structure of 
B(n, m)'^'^. These values are powers of n and to. It is possible to define complex 
exponents v and such that 

nk{i) = n^«(^''«)n^'"(^''«) 

mk{i) = n^«(«W)n^'"(««) 

and the recursion relations become 

z/fc(2i - 1) = Vk-i{i) + 1 

Vk{2i) = Hk-iii) + 1 
/Ufe(2i - 1) = Hk-i{i) + 
/Xft(2i) = Uk-i{i) + 



where 

Observe that 

and 

Thus 



= 1 and = \/^ 

Re{v) = Im{ii) and Im{v) = Re{ii) 
Re{u) = k — Im{v) and i?e(/i) = k — Im{y) 



So the entire structure can be determined by the single sequence Re{yk{i))- Re{uQ) 
in a stacked form is 

1 

2 1 

3 12 2 

41233232 

5124334242334233 

61253452434352345234434352344343 
Likewise, Reiy^) in a tree depiction is 

1 

2 1 
3 12 2 

41233232 

5124334242334233 
61253452434352345234434352344343 

which reveals 

Re{vk{i)) + i?eK(« + 1)) = + 1, z = 1, 3, . . . 

In addition, it is apparent that the higher/lower values (e.g., 6 and 1) are more 
closely associated with lower values of i while the more intermediate values (e.g., 4 
and 3) become more prevalent as i increases. 
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1 0.6 A more general model of Kronecker graphs 

A useful generating graph that can be used to model a wide range of power law 
graphs has the following form 

G{n + 1) = (3S{n + 1) + al(n + 1) + jl{n + 1) 

Noting that S(n + 1) = B(n, 1) allows the more generalized model 

G{n, m) — /3B(ri, m) + al{n + m) + 7l(n + m) 

In addition to the aforementioned specialization, several special cases are also of 
interest. The trivial cases are 

a>0, /3 = 0, 7 = 0 

which implies 

G(n, m)®'' = a'=I((n + m)'=) = a*^I(iV'=) 

and 

a = 0, ^ = 0, 7>0 

which implies 

G(n, m)®'= = 7'=l((n + m)*^) = 7'=1(7V'=) 
The less trivial case of 

a = 0, P>0, 7 = 0 
has already been worked out in the previous section 

G(n,m)®'==/3'=B(n,m)®'= 

Finally, the case where 

a = 7>l, /3 = 0 

deals with the expression 

G(n, m)«5'= = a''{I{n + m) + l((n + m))®'' 
In general, the most interesting case is where 

1>,3>q;>7>0 

This model corresponds to a foreground bipartite plus identity graph with a back- 
ground probability of any vertex connecting to any other vertex 

G(n, m) = (3'B{n, m) + al{n, m) + 7l(n + m) 
= + al + 7I 

However, because 7 <C a, the situation can be approximated as follows 

G(n, m)®^ ^ (/3B + a/)®'= + 7" (^^ ^ J " l(^) ® (/3B + al)®^-^ 
+ lower order terms 
« (^B + al)®^ + 7"A;"l(iV) ® {^B + al)®^'^ 
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For further theoretical analysis, consider the explicit form with the understanding 
that the appropriate a, /3, and 7 terms can be reinstated when needed for a stochas- 
tic representation. Under these conditions, the explicit form for the above formula 
is simply 



G(n,m)« 



(B + I)®*^ + (B + I) 



0k-l 



10.6.1 Sparsity analysis 

The structure of (B + I)'^'^ is given by 




where " " denotes that permutations are required to actually combine these terms. 
The sparsity structure of the above term is 

a((B + 1)®'=) = ^ aj^- = (<7B + a,)' 

r=0 ^ ^ 

Excluding the background (7I), the sparsity contribution of the rth term is 



Computing the sparsity of the equations of previous sections for the case where 
/Sctb > aai gives 



fc-i 

B 



Applying this approximation back to the above equations gives 

G(n, m)®*^ « y^^B®*^ + a^'' 0 B®*^-^ + 7^*^"^ "fc" 1 O B®*^"! 
« /3'=B®*^ + "fc" (al + 7I) 0 (/3B)®'=-i 

Excluding the background, the explicit form is 

G(n,m)®'= w B®*^ + "fc"B®'=-i O I 

The relative contributions to the overall sparsity can be estimated as follows. In 
the case where B = B(n, 1), we have 



cTi + aBj \{3N-2)/Ny ~ V3 
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In the case where B = B(7i, rt), we have 

/ CTB y_ / (jVV2)/jV^ y ^ f N \' 
\cri + TBj \{N + N^/2)/m) \N + 2) 

In either case, as k increases, the B***^ wih go from dominating the overall sparsity 
to being a lesser component. That said, this term will always be the dominant single 
term and dominates the structure of the overall graph since more terms simply add 
a more "diffuse" component (see next section). 

10.6.2 Second order terms 

We can make the above expression more precise by inserting the full expression for 
the second term 

k-l 

which removes the arbitrary permutation giving 

fe-i 

G(n, m)®*^ w B®*^ + ^ g^fe-i-i ^ i ^ b^' 
/=o 

Applying the recursive bipartite permutation produces 

Pfc(G(n, m)®'=) w Pfc(B®'^) + ^ Pfe(B®'^-i-' ® I ® B®') 

/=o 

The first term has already been discussed and it describes a sequence of disjoint bi- 
partite graphs B(rtfe(i), mk(i))- The second term describes the connections between 
these bipartite graphs (see Figure 10.6). 

The second order terms can be broken into two components 



Pj;(B^'=-i-' (g) I (g) B®') = xf ^ XM: 



where xf is a 2*^ x 2*^ adjacency matrix that specifies the connections between the 
blocks in the bipartite graph (see Figure 10.7). For convenience, let 

nk^{nk{l) mfe(l) ... nfe(2^-i) mfc(2'=-i)} 

Then xf(*ij*2) = 1 means that there is a connection between the block of vertices 
specified by nk(ii) and the block of vertices specified by nk{i2)- xf is computed by 
simply setting n — 1 and m — 1, which gives 

xf = Pfc(B(l, 1)®'=-!-' ® 1(2) (g B(l, 1)®') 

Furthermore, since each row or column of xf has only one value, the connections 
between blocks can be expressed as follows 

i2 =ii + Af(ii) 
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(B+l)®3 



1st+2nd Order 



Higher Order 







r 

















P3(B®3) 



r 



P3(B®3+B0I0B) 



P3(B®3+B0B0I) 















1 








' If 





r 



P3-(B®3+ I0B0B) 



Figure 10.6. The revealed structure of (B + I)®^. 

Upper part shows that a large fraction of the nonzero elements are from 
the first and second order terms. Lower part shows each of the second 
order terms after the recursive bipartite permutation has been applied. 
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r \" 




r- r 





















Figure 10.7. Structure of (B + 1)*^^ and corresponding xf- 



These values are plotted in Figure 10.8 and show that the k blocks each block is 
connected to effectively span the set of 2^ blocks. 

The strength of the connection between blocks of vertices is given by Xiii2 
which is an nk{ii) x nk{i2) matrix with the following structure 



= l{nk{ii)/n,fLk{i2)/n) (E)I{n) 



xtt2 = l{nk{ii)/m,nk{i2)/m) (E)I{m) 
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Figure 10.8. Block connections Af (i). 



Thus, although each block of vertices will be connected to k other blocks, the 
strength of these connections is weaker by a factor of n or m compared to the 
connections within each bipartite block. 



10.6.3 Higher order terms 

A similar analysis can be applied to the higher order terms. The overall effect is 
that the higher order terms are more numerous and more diffuse. In this respect, 
these higher order terms can be equated to a weakly structured background and 
obviate the need for the 7 term. 

The order with the largest contribution f is the solution to 



where 




If (/) = 1, then the maximum is just the peak of the binomial coefficient f — k/2. In 
general, (f> > 0 and so f < k/2. 
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10.6.4 Degree distribution 

To compute the degree distribution of the second order terms of a given vertex v 
requires first finding its corresponding block of vertices fikii) by using the bipartite 
index tree. Then aU the contributions from the different terms are summed 

1=0 

Each of the above terms wiU tend to increase the number of unique values in the 
overall degree distribution. However, when summed with the other first order terms, 
they combine coherently. The result is that the degree distribution of the second 
and higher order terms is a simple horizontal offset from the degree distribution of 
the first order terms 

j^r^^fc-r _j_ n'^m'^^^ll + r/n + (k - r)/m] 

or 

Count[Deg = Tfm^^^[l + r/n + (fc — r)/m] 

where it is assumed n > m. Interestingly, the above expression can also be written 
in terms of the partial derivatives of n and m 

[I + dn + dmWm'^^'' = n''m''^''[l + r/n + {k - r)/m] 

For the case where m = 1, the above expression simplifies to 

n'' 7i''[I + (fc - r) + r/n] 

Perhaps even more interesting is that the higher order terms also coherently sum 
in a similar fashion. The degree distribution of (B + I)*^*^ is simply 



Count[Deg = [n + If [m + 1)'=-^ = 



For the case where to = 1, the above expression simplifies to 

Count[Deg = (n + If] = 

These offsets are illustrated in Figure 10.9. In each case, the effect of the higher 
order terms is to cause the slope of the degree distribution to become steeper. 

10.6.5 Graph diameter and eigenvalues 

The diameter of (B+I)***^ can be readily deduced from Theorem 5 of [Leskovcc 2005], 
which states that if G has a given diameter and contains I, then G®*^ has the same 
diameter as G. Thus 

DiamiiB + I)'^'') = 2 
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Figure 10.9. Degree distribution of higher orders. 

The degree distribution of the higher orders is a simple horizontal offset 
of the degree distribution of B®''. Values shown are for n = 4, m = 1, 
and fc = 5. 



We can understand this by noting that even in the sparsest case (B(n, 1)) the edges 
per vertex are 

Subtracting the set of these that are intrablock edges leaves (for reasonable values 
of n and fc) 

(Interblock edges)/iV„ w 3'^ - n''''^ w 3^= 

which is enough interblock edges to keep up with the 2*^ growth rate in the number 
of blocks. 

The more interesting situation occurs when an instance of the graph is created. 
In this situation, it is typical to hold the ratio Ne/N^ constant. Since, at best, each 
vertex is connected to Ng/Ny of the 2*^ blocks, this implies 

or 

Diam{Instance{{B + 1)®'=)) « fc/lg(7Ve/^^) = 0(lg{Ne)) 
The eigenvalues of B + I are given by 

N-2 

eig{B + I) = eig(B) + 1 = {{nmy^^ + 1, 1 - (nm)^/'^} 
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The largest eigenvalues of (B + I)"^*^ are thus 

eigiiB+I)^'') = {((nm)i/2+i)'=, ((nm)i/2+l)'=-i, ((nm)i/2_i)2((„^)i/2+i)fc-2^ _ _ _| 

Interestingly, for these kinds of Kronecker graphs, there is not an obvious relation 
between the eigenvalues and the diameter. 

1 0.6.6 Iso-parametric ratio 

For (B + I)®*^ the iso-parametric ratio of only those vertices in nk{i) or mk{i) is 

IsoPar{nk{i)) = ttt - 2 

nk[t) 

The denominator is from the I®*^ term, which implies that every vertex always has 
a self-loop. Letting r = r{i) and using the formula for the degree distribution of 
(B -I- I)®*^ result in 

IsoPar{nk{i)) = IsoPar{rfm^^^) 

= ri'rrr — — 2 

= 2{n+l)''-''{m+lY -2 
For TO = 1, the above expression simplifies to 

IsoPar{'nr) = 2*^+^(n + 1)'=-'' - 2 

which has a maximum value at r = 0 and decreases exponentially to a minimum 
value at r = A; 

IsoPar{n°) = 2''+\n + 1)'= - 2 

«2*=+i(n+l)'= 
IsoPar{n'') = 2*^+^ - 2 

«2*=+i 

The next case, when the subset consists of all the vertices in subgraph B(nfc(«), 
mk{i)), is quite similar except there arc additional terms in the denominator 

T D f r-\\ o E A(nfe(i) U TO,fe(i), :) 
IsoParinkM U mkM) = 2 — — 2 

nk(i)mi\i) + nk(i) + mk{t) + [x terms] 

_ n,(A-) /.soP« ;■(;/, (/,■)) + i.wPai\in ,{k)) ^ 

nk{i)mk{i) + nk{i) + mk{i) + [x terms] 
where IsoPar = IsoPar + 2. Substituting for nk{i) and mk{i) gives 

IsoPar{nk{i) U mk{i)) 

= IsoPar{n''m'^^'' U n^^^'m'') 

n^m^~^ I soPar{rf m'^^^) + n''~^m^ I soPar{n^~^ mJ') ^ 
2n*^TO'= -|- n^m''~^ + n''~^m'^ + [x terms] 
_ ^ Tfm''-''{n + 1)^-''(to + ly + n''-'''mr{n + lY{m + if-'' _ ^ 
2n'=TO'= + n^m''~'^ + n''~^m^ + [x terms] 
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For m = 1, the above expression simplifies to 

2n^ + n'" + n"^ + [x termsj 
Ignoring the x terms, the above expression has a maximum at r = 0 and r — k 

IsoPar{n° U n'') = IsoPar{n^ U n°) 

_ Jn+l)^ + (2n)^ ^ 
in^ + 1 



2fe 



and has a minimum at r = fc/2 



^ ^ + 1 



It is interesting to note that iso-parametric ratios of the set nk{i) U mk{i) are 
always much smaller than the iso-parametric ratio of either nk{i) or mk(i). In other 
words 

IsoPar{nk{i)) S> IsoPar{nk{i) U nik{i)) 
IsoPar{mk{i)) S> IsoPar{nk{i) U mk{i)) 

This is illustrated in Figure 10.10. 



10.7 Implications of bipartite substructure 

The previous sections have focused on analyzing the properties of the explicit graphs 
generated by Kronecker products. This section looks at some of the implications of 
these results to other types of graphs. 



10.7.1 Relation between explicit and instance graphs 

The properties of a particular graph instance randomly drawn from an explicit graph 
can be computed directly from the properties of the explicit graph. Let a particular 
instance graph have vertices Ny and edges N^- Furthermore, let /e denote the ratio 
of the number of edges in the instance graph to the number of edges in the explicit 
graph. For a G'**'', this is 

The degree distribution of an instance graph will then be the sum of the Poisson 
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Figure 10.10. Iso-parametric ratios. 

The iso-paramctric ratios as a function of input set for (B + I)®''. Values 
sliown are for n — A, m — 1, and fe = 5. 



distributions of the degree distribution of the explicit graph 
Count[Deg{Instance{G'^'')) = j] = (■^') 



where 

and Xi is the expected number of edges per vertex for each block of vertices i. For 
G = B(n, m), this is 

\i = fen'm^-' 

where 

/e - {N,/N,)l{2nmlNf 
which for m = 1 is w {Ne/Ny)/2^. Likewise, for G = B + 1 



rn(m+l) 



where 



/e = {N,/K,)/{{2nm + N)/Nf 

Figure 10.11 shows the degree distribution of 1,000,000 edge and 125,000 edge 
instance graphs taken from B(4, 1)**^. 
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Figure 10.11. Instance degree distribution. 

The degree distribution of 1,000,000 edge and 125,000 edge instance 
graphs taken from B(4, 1)®^. Circles show the measured degree distri- 
bution of the instance graphs. Solid lines show the predicted degree 
distribution obtained by summing the Poisson distributions. Dashed 
lines show the Poisson distributions of each of the underlying terms in 
the explicit graph. The outlier points at high vertex degrees are because 
the vertex count can be less than one, and although the probability of 
one of these vertices occurring is low, the aggregate probability is enough 
to generate a small number of these high vertex degree points. 
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The peak-to-peak slope of the above distribution can be computed by assuming 
that Xi is sufficiently large that the Poisson distributions can be approximated by 
a Gaussian distribution 

Pa.(j) ~ Gauss^^ ,^i/2(i) = {2nX,y'^^ eM~ij - K)VX^] 

The peaks will occur at j « Xi 

Pa.(A.)« (27rA,)-i/2 
which results in the following formulas for the slopes 

q; rx^x 1~ 3 logJ(fc-0/(^+l)) 

Slope[Xi Xi+i\ « -T^ H -, ri~\ 

2 log„(n/TO) 

and 

3 

Slope[Xi ^- Xk-i] ~ -- 

Thus, the peak-to-peak slope over any symmetric interval of an instance graph is 
also a power law, but with a slightly steeper slope. 



10.7.2 Clustering power law graphs 

A standard technique used for analyzing graphs is to attempt to cluster the graph. 
The basic clustering heuristic states the following [Radicchi 2004] 

Qualitatively, a community is defined as a subset of nodes within the 
graph such that connections between the nodes are denser than connec- 
tions with the rest of the network. 

Gonsider a graph that ideally suits the clustering heuristic: vertices inside clusters 
have many more connections within the cluster than outside the cluster. This is 
ideally represented by a series of A^i loosely connected cliques of size A^2 

Ni 

I{Ni) ® 1{N2) + jl{Ni +N2)^[j 1(^2) + 71(^^1 + ^2) 

If 7 = 0, the degree distribution of such a graph is peaked around 

Count[Deg{g) = N2] = N1N2 

This will roughly correspond to a single Poisson peak (see Figure 10.11). If 7 > 0, 
then the degree distribution will be broadened slightly around this peak. The dis- 
tribution can be broadened further by varying N2. Making this type of distribution 
consistent with a power law distribution is a challenge because it fundamentally 
consists of only one Poission distribution. 
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10.7.3 Dendragram and power law graphs 

Another technique for analyzing graphs is to attempt to organize the graph into a 
dendragram or tree structure. The basic dendragram heuristic states the following 
[Radicchi 2004] 

The detection of the community structure in a network is generally 
intended as a procedure for mapping the network into a tree. In this 
tree (called a dendragram in the social sciences) , the leaves are the nodes 
whereas the branches join nodes or (at higher level) groups of nodes, thus 
identifying a hierarchical structure of communities nested within each 
other. 

Likewise, the ideal graph to apply a dendragram heuristic is a tree. A tree of degree 
k with I levels will also have a degree distribution that is peaked at 

Count[Deg{G) = fc + 1] = fc' 

This will roughly correspond to a single Poisson peak (see Figure 10.11). The 
distribution can be broadened further by varying k with I, but making this type 
of distribution consistent with a power law distribution is a challenge because it 
fundamentally consists of only one Poission distribution. 

1 0.8 Conclusions and future work 

An analytical theory of power law graphs based on the Kronecker graph genera- 
tion technique is presented. The analysis uses Kronecker exponentials of complete 
bipartite graphs to formulate the substructure of such graphs. This approach al- 
lows various high-level quantities (e.g., degree distribution, betweenness centrality, 
diameter, and eigenvalues) to be computed directly from the model parameters. 

The analysis presented here on power law Kronecker matrices shows that the 
graph substructure is very different from those found in clusters or trees. Further- 
more, the substructure changes qualitatively as a function of the degree distribution. 

There are a number of avenues that can be pursued in subsequent work. These 
include 

• Apply recursive bipartite permutation to various instance matrices. 

• Examine if substructures produced in these generators are similar to those 
found in real data. 

• Characterize vertices by their local bipartite adjacency matrix. 

• Use models to predict exact computational complexity of algorithms on these 
graphs. 
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• Explore other "Kronecker" operations. For example 

A = B(g)+C 

A = B 0™"^ C 
A = B 0^°'' C 

• Block diagonal matrix as a generalization of I. For example 

N 

I{N) 0 1 = U 1 
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Visualizing Large 
Kronecker Graphs 



Huy Nguyen*, Jeremy Kepner^ , and Alan Edelman^ 



Abstract 

Kronecker graphs have been shown to be one of the most promising 
models for real-world networks. Visualization of Kronecker graphs is an 
important challenge. This chapter describes an interactive framework to 
assist scientists and engineers in generating, analyzing, and visualizing 
Kronecker graphs with as little effort as possible. 

11.1 Introduction 

Kronecker graphs are of interest to the graph mining community because they 
possess many important patterns of realistic networks and are useful for theoret- 
ical analysis and proof (see [Lcskovec et al. 2010] and Chapters 9 and 10). Once 
the model is fitted to the real networks, many applications can be built on top 
of Kronecker graphs, including graph compression, extrapolation, sampling, and 
anonymization. Moreover, there are efficient algorithms that can find Kronecker 
graphs that match important patterns of real networks [Leskovcx; et al. 2010]. Nev- 
ertheless, our understanding of Kronecker graphs is still limited. Chapter 9 (see 
also [Kepncr 2008]) has shown that a simple combination of bipartite plus identity 
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Figure 11.1. The seed matrix G. 

graphs can generate a rich class of graphs that have many important patterns of a 
reahstic network. 

We have developed an interactive toolkit that assists scientists and algorithm 
designers in generating, analyzing, and visualizing Kronecker graphs. On a com- 
modity workstation, the framework can interactively generate and analyze Kro- 
necker graphs with millions of vertices. In order to visualize Kronecker graphs, 
we implement a visualizing algorithm on a parallel system that can then efficiently 
drive a large 250 megapixel display wall [Hill 2009]. This system allows for effective 
visualization of graphs of up to 100,000 vertices. Moreover, the framework is de- 
signed in an intuitive and interactive fashion that is easy to use and does not require 
much experience from the users in working with Kronecker graphs. We believe this 
tool can be a potential first step for anyone who wants to use Kronecker graphs as 
a model for their own networks. 



1 1 .2 Kronecker graph model 

The working model of Kronecker graphs in our framework is a generalization of 
the bipartite stochastic model (see Chapter 9). In particular, let G be the seed 
matrix that is used to generate Kronecker graphs. G is a linear combination of a 
bipartite matrix B(n,TO) and a diagonal matrix I (see Figure 11.1). B(n,m) is a 
four-quadrant matrix with size (m + n) x (m + n). The values of the entries in each 
of the quadrants are a, 6, c, and d. In the diagonal matrix, all entries in the main 
diagonal have value i. 

This simple model covers a range of Kronecker graphs. The stochastic model 
presented in [Leskovec ct al. 2010] is a special case with m ~ n — 1 and i = 0. The 
model in [Kcpncr 2008] (where G is the union of a bipartite graph and an identity 
graph) is also a special case of our model with a = d = 0 and b — c — i — 1. 
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1 1 .3 Kronecker graph generator 

Let N and M be the desired number of vertices and edges of the Kronecker graph. 
If A'^ = (m+n)*^, it is simple to generate a graph with the desired number of vertices. 
Specificahy, we consider the N x N matrix A = G'^'^ where G is the seed matrix 
given above. For any pair of vertices i and j, a directed edge from i to j is created 
with probabiUty A{i,j). In case the desired graph is sparse, M = 0{N), computing 
the whole matrix A is redundant. Instead, the algorithm can just generate the edge 
list directly from the seed matrix, with total running time 0{M log N). 

However, in case A'^ is not a power of m + n, it is unclear how to gener- 
ate Kronecker graphs from the algorithm above. A simple interpolation algo- 
rithm that creates a larger Kronecker graph (with (m + n)'^ vertices such that 
{m + n)'^~^ < N < {m + n)'^) and then selects an appropriate subgraph does not 
work. Experiments show that simple interpolation produces large jumps (400%) 
on the edge/vertex {M/N) ratio in the resulting graph (see Figure 11.2). Sudden 
jumps in the edge/vertex ratio are not consistent with how real- world graphs grow. 

The jumps in the edge/vertex ratio can be reduced by selecting the subgraph 
more intelligently using our "organic growth" interpolation algorithm. For a given 
desired N ^ our algorithm picks {m-\-n)^ — the smallest power of to -I- n that is larger 
than N . Then, we generate an edge list (directly from the seed matrix) for the 
graph of size (to -I- n)^ x {m + n)'^. Now, instead of taking a subgraph of size N, we 
randomly shuffle the labels of the vertices and then pick the N vertices with highest 
degree. As shown in Figure 11.2, generating Kronecker graphs in this way reduces 
the interpolation error of the edge/ vertex ratio significantly. 

1 1 .4 Analyzing Kronecker graphs 

Analyzing Kronecker graphs is the main feature of our framework. Once a graph 
is generated, it can be analyzed by three different methods: graph metrics, graph 
view, and graph organic growth. The graph metrics are a set of statistics about 
the structure of the graph that can be used to compare the similarity between the 
generated Kronecker graph and the target real network. The graph view helps 
users observe the generated graph from different perspectives and thereby identify 
important properties of the graph. Finally, the graph organic growth simulates the 
growing (or shrinking) process of a network graph using the previously described 
interpolation algorithm. 

11.4.1 Graph metrics 

We compute a set of statistics that can be used to derive many of the important 
metrics of a given graph. 

• Degree distribution power law exponent: The degree distribution of a graph is 
a power law if the number of vertices with degree d in the graph is proportional 
to d~'^ where A > 0 is the power law exponent. Both the real networks and 
the Kronecker graphs can exhibit power law degree distribution. 
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Figure 11.2. Interpolation algorithm comparison. 

Desired number of vertices versus the resulting number of edges for the 
simple interpolation algorithm (top) and our organic growth interpola- 
tion algorithm (bottom). The simple interpolation approach can pro- 
duce a 400% jump in the number of edges when the desired number of 
vertices N is very close to (n -I- m)''. The organic growth interpolation 
algorithm randomizes the vertex labels and selects the highest degree 
nodes to produce a much smoother curve with smaller jumps at these 
boundaries. 



• Densification power law exponent: The number of edges M may be propor- 
tional to where a is the densification power law exponent. Similar to the 
degree distribution power law exponent, this exponent can also be used as a 
metric of the graph [Lcskovcc ct al. 2005]. 
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• Effective graph diameter: The effective graph diameter was proposed in 
[Lcskovcc et 'HW^)] as a robust measurement of the maximum distance 
between vertices in the graph. It is observed that many real-world net- 
works have relatively small effective graph diameter (see [Lcskovcc ct al. 2005, 
Albert 2001]). The diameter of the graph is not used here since it is suscep- 
tible to outliers. 

• Hop plot: The function f : k ^ f{k), which is the fraction of pairs of vertices 
that have distance k in the graph [Palmer et al. 2002]. 

• Node triangle participation: The function g : k ^ 5(^)7 which is the number 
of vertices that participate in k triangles in the graph [fsourakakis 2008]. 

'['[A.l Graph view 

Graph view is a visual way to study the structure of a Kronecker graph. The view 
is a 2D image of its 0/1 adjacency matrix under some permutation of the vertices. 
By adding more than one permutation to the framework, we hope that the images 
they create can give the users different perspectives about the graph structure and 
can help identify useful patterns. For example, if we view a Kronecker graph in the 
order it was generated, it is difficult to realize that its degree distribution obeys the 
power law. However, if we view the graph in degree sorted order, the power law 
property of its degree distribution can be easily noticed. Our framework allows a 
user to visualize the generated graph in four different permutations: 

• Degree sorted: This view corresponds to vertices in nonincreasing order of 
degree. 

• Randomized: This is the view taken from a random permutation. 

• As generated: This is the view where the original order of vertices is used. 

• Bipartite: This view exploits the inherent structure of a bipartite generating 
graph. The permutation takes advantage of the near-bipartiteness of the 
seed graph to efficiently detect the highly connected components in the graph 
and separate them from others. For more details on this permutation, see 
Chapter 10. 

1 1 .4.3 Organic growth simulation 

In addition to the graph metrics and graph view, which only work with static 
Kronecker graphs, it is possible to use the organic growth interpolation algorithm 
to simulate how a graph might grow (or shrink). The simulation can show how a 
graph has been growing up from a single vertex to the current state and beyond. 
The main application of this feature is network extrapolation [Lcskovcc et al. 2010]. 
Given a real-world network and a Kronecker graph G that models that network, 
then by applying organic growth simulation on G, we can look into the future to see 
how the network might evolve. Similarly, organic growth simulation can also help 
us look into the past to see how the network might have looked in its early stages. 
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Figure 11.3. Graph permutations. 

Views of different permutations of a Kronecker graph generated with 
a bipartite generating graph. Top left: As generated. Bottom left: 
Randomized. Top right: Degree sorted. Bottom right: Bipartite. The 
bipartite permutation clearly shows that the resulting graph consists of 
eight disconnected bipartite graphs. 



1 1 .5 Visualizing Kronecker graphs in 3D 

Good visualization is an important part of an analytical framework. In this sec- 
tion, we describe our tool to visualize a Kronecker graph by projecting it onto the 
surface of a sphere. The idea of embedding a Kronecker graph onto a sphere was 
proposed and proved effective by Gilbert, Reinhardt, and Shah [Gill)crt ct al. 2007]. 
However, their embedding algorithm, which was designed for general graphs, did 
not take advantage of Kronecker graphs' structure. In our algorithm, we use the 
bipartite clustering method (as in the bipartite view) to partition the graph into 
well-connected components (see Figures 11.3 and 11.4) and embed them onto the 
sphere. As a result, the visualization quality has been improved significantly com- 
pared to the Fiedler mapping method (see Figure 11.5). 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



1 1 .5. Visualizing Kronecker graphs in 3D 



247 




Figure 11.4. Concentric bipartite mapping. 

Left: A near bipartite subgraph. Right: Mapping of subgraph on con- 
centric circles. 

1 1 .5.1 Embedding Kronecker graphs onto a sphere surface 

Given a Kronecker graph, our visualization first partitions such a graph into dense 
near-bipartite subgraphs using the bipartite permutation (see Chapter 10). Each 
subgraph is organized into a pair of concentric circles (see Figure 11.4), and these 
subgraphs are then placed onto a sphere so that they do not overlap. More specif- 
ically, for each such subgraph B(n,TO), B comprises two nearly disjoint sets with 
n and m, where n > m. The subgraph is mapped onto two concentric circles such 
that n points are in the outer circle and m points are on the inner circle (see Fig- 
ure 11.5). All the concentric circles are embedded on a sphere surface by using 
the Golden Section spiral method [Rusin 199'S], which guarantees that the circles 
do not intersect and are evenly distributed. Because the majority of edges in the 
graph are internal to the subgraph, the visualization is pleasing to the eye and the 
overall structure of the Kronecker graph can be seen clearly. 

11.5.2 Visualizing Kronecker graphs on parallel system 

As the framework is designed to work with very large Kronecker graphs (up to 
100,000 vertices), it is not practical to visualize them on a commodity workstation. 
Therefore, we implement our three-dimensional (3D) visualization algorithm on a 
parallel system with 60 display panels (2560 x 1600 pixel each) and 30 computational 
nodes (each node is responsible for 2 display panels), see Figure 11.6. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



248 



Chapter 1 1 . Visualizing Large Kronecker Graphs 





Figure 11.5. Kronecker graph visualizations. 

Top: Fiedler mapping of a Kronecker graph onto a sphere. Bottom: 
Mapping of Kronecker graph with bipartite clustering method. 



Figure 11.7 shows how the 3D visualization is designed on the parallel system. 
First, the graph is distributed to all computational nodes. Then, for each node, 
all vertices and edges of the graphs that are not visible on that node will be re- 
moved. Finally, visible edges and vertices are mapped onto a sphere surface using 
the algorithm above and rendered on the displaying panels. 
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Figure 11.7. Parallel system to visualize a Kronecker graph in 3D. 

Components consist of input graph, hidden surface removal (HSR), 
graphics processing units (GPU), and video display panels. 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



250 



Chapter 1 1 . Visualizing Large Kronecker Graphs 



References 

[Albert 2001] R.Z. Albert and A.-L. Barabsi. Statistical mechanics of complex net- 
works. Reviews of Modern Physics, 74:47 97, 2002. 

[Gilbert et al. 2007] J.R. Gilbert, S. Reinhardt, and V. Shah. An interactive envi- 
ronment to manipulate large graphs. In Proceedings of the 2007 IEEE Interna- 
tional Conference on Acoustics, Speech, and Signal Processing, 4:IV-1201-IV- 
1204 2007. 

[Hill 2009] C. Hill. The Darwin Project, http://darwinproject.mit.edu/ 

[Kcpncr 2008] J. Kepner. Analytic theory of power law graphs. In SIAM Parallel 
Processing 2008, Minisymposium on HPC on Large Graphs, Atlanta, GA, 2008. 

[Leskovec et al. 2010] J. Leskovec, D. Chakrabarti, J. Kleinberg, G. Faloutsos, and 
Z. Ghahramani. Kronecker graphs: An approach to modeling networks. Journal 
of Machine Learning Research, 11:985-1042, 2010. 

[Leskovec et al. 2005] J. Leskovec, J. Kleinberg, and G. Faloutsos. Graphs over 
time: Densification laws, shrinking diameters and possible explanations. In Pro- 
ceedings of the Eleventh ACM SIGKDD International Conference on Knowledge 
Discovery in Data Mining (KDD '05), 177-187, 2005. 

[Palmer et al. 2002] G.R. Palmer, P.B. Gibbons, and C. Faloutsos. ANF: A fast and 

scalable tool for data mining in massive graphs, In Proceedings of the Interna- 
tional Conference on Knowledge Discovery in Data Mining (KDD '02), 81-90, 
2002. 

[Rusin 1998] D. Rusin. Topics on sphere distributions. 1998. http://www.math.niu. 
edu / " rusin /known- math /95/ sphere. faq. 

[Tsourakakis 2008] G.E. Tsourakakis. Fast comitiug of triangles in large real net- 
works without counting: Algorithms and laws. In Proceedings of the IEEE In- 
ternational Conference on Data Mining, 608-617, 2008. 



Downloaded 09 Deo 2011 to 129.174.55.245. Redistribution subjeotto SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 



Chapter 12 

Large-Scale Network 
Analysis 



David A. Bader* , Christine E. Heitscfi^ , and 
Kamesh Madduri^ 

Abstract 

Centrality analysis deals with the identification of critical vertices and 
edges in real- world graph abstractions. Graph-theoretic centrality heuris- 
tics such as betweenness and closeness are widely used in application do- 
mains ranging from social network analysis to systems biology. In this 
chapter, we discuss several new results related to large-scale graph anal- 
ysis using centrality indices. Wc present tlic first parallel algorithms and 
efficient implementations for evaluating these compute- intensive metrics. 
Our parallel algorithms are optimized for real- world networks, and they 
exploit topological properties such as the low graph diameter and un- 
balanced degree distributions. We evaluate centrality indices for several 
large-scale networks such as web crawls, protein-interaction networks 
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(PINs), movie-actor networks, and patent citation networks that are 
three orders of magnitude larger than instances that can be processed 
by current social network analysis packages. As an application to sys- 
tems biology, we present the novel case study of betweenness centrality 
analysis applied to eukaryotic PINs. We make an important observa- 
tion that proteins with high betweenness centrality, but low degree, are 
abundant in the human and yeast PINs. 

12.1 Introduction 

Graph abstractions are used to model interactions in a variety of application do- 
mains such as social networks (friendship circles, organizational networks), the 
Internet (network topologies, the web-graph, peer-to-peer networks), transporta- 
tion networks, electrical circuits, genealogical research, and computational biol- 
ogy (protein-interaction networks, food webs). These networks seem to be en- 
tirely unrelated and indeed represent quite diverse relations, but experimental stud- 
ies [Baraliasi .t- Mwrt 2007, Brodcr ct al. 2000, Newman 2003] show that they 
share common traits such as a low average distance between the vertices (the small- 
world property), heavy-tailed degree distributions modeled by power laws, and high 
local densities. Modeling these networks based on experiments and measurements, 
and the study of interesting phenomena and observations [Callaway et al. 2000, 
Cohen ct al. 2001, Pastor-Satorras Vcspignani 2001, Zanette 2001], continue to 
be active areas of research. Several models (see, e.g., [Guillaume & Latapy 2004, 
Newman ct al. 2002, Palmer & Stcffan 2000]) have been proposed to generate syn- 
thetic graph instances with these characteristics. 

Complex network analysis traces its roots to the social sciences (see 
[Scott 2000, Wasserman & Faust 1994]), and seminal contributions in this field date 
back more than sixty years. There are several analytical tools (see, for instance, 
[Huisman & van Duijn 200o]) for visualizing social networks, determining empiri- 
cal quantitative indices, and clustering. In most applications, graph abstractions 
and algorithms are frequently used to help capture the salient features. Thus, so- 
cial network analysis (SNA) from a graph-theoretic perspective is about extracting 
interesting information, given a large graph constructed from a real- world data set. 

Network modeling has received considerable attention in recent times, but 
algorithms are relatively less studied. Real-world graphs are typically character- 
ized by low diameter, heavy-tailed degree distributions modeled by power laws, 
and self-similarity. They can be very large and sparse, with the number of ver- 
tices and edges ranging from several hundreds of thousands to billions. On current 
workstations, it is not possible to do exact in-core computations on these graphs 
because of the limited physical memory. In such cases, parallel computing tech- 
niques can be applied to obtain exact solutions for memory and compute-intensive 
graph problems quickly. For instance, recent experimental studies on breadth-first 
search for large-scale graphs show that a parallel in-core implementation is two 
orders of magnitude faster than an optimized external memory implementation 
[Ajwani et al. 2006, Bader & Madduri 2006a]. The design of efficient parallel graph 
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algorithms is quite challenging [Lunisdaine et al. 20U7] because massive graphs that 
occur in real-world applications are often not amenable to a balanced partitioning 
among processors of a parallel system [Lang 2005]. Algorithm design is simplified 
on parallel shared memory systems; they offer a higher memory bandwidth and 
lower latency than clusters, and the global shared memory obviates the need for 
partitioning the graph. However, the locality characteristics of parallel graph algo- 
rithms tend to be poorer than their sequential counterparts [Cong & Sbaragha 2006] , 
and so achieving good performance is still a challenge. 
The key contributions of our work are as follows. 

• We present the first parallel algorithms for efhcicntly computing the follow- 
ing centrality metrics: degree, closeness, stress, and betweenness (see, for 
instance, [Bader <*v Maddmi 200G))]). We optimize the algorithms to exploit 
typical topological features of real-world graphs, and demonstrate the capa- 
bility to process data sets that are three orders of magnitude larger than the 
networks that can be processed by existing social network analysis packages. 

• We present the novel case study of betweenness centrality analysis applied to 
eukaryotic protein interaction networks (PINs) . Jeongetal. [.T(M)iip; <^t al. 2001] 
empirically show that betweenness is positively correlated with a protein's 
essentiality and evolutionary age. We observe that proteins with high be- 
tweenness centrality but low degree are abundant in the human and yeast 
PINs, and that current small-world network models fail to model this feature 
[Badcr Madduri 2008]. 

• As a global shortest paths-based analysis metric, betweenness is highly corre- 



lated with routing and data congestion in information networks; see [Ilohnc 2003, 



Singh & Gupte 2005]. We investigate the centrality of the integer torus, a 
popular interconnection network topology for supercomputers. We state and 
prove an empirical conjecture for betweenness centrality of all the vertices 
in this regular topology. This result is used as a validation technique in the 
HPCS Graph Analysis benchmark [Badcr et al. 2006]. 

This chapter is organized as follows. Section 12.2 gives an overview of various 
centrality metrics and the sequential algorithms to compute them. We present our 
new parallel algorithms to compute centrality indices and optimizations for real- 
world networks in Section 12.3. Section 12.4 discusses implementation details and 
the performance of these algorithms on parallel shared memory and multithreaded 
architectures. We discuss the case study of betweenness applied to PINs in Sec- 
tion 12.5 and our result on betweenness for an integer torus in Section 12.6. 



One of the fundamental problems in network analysis is to determine the importance 
or criticality of a particular vertex or an edge in a network. Quantifying centrality 
and connectivity helps us identify or isolate regions of the network that may play 
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interesting roles. Researchers have been proposing metrics for centrahty for the 
past fifty years, and there is no single accepted definition. The metric of choice 
is dependent on the application and the network topology. Almost all metrics 
are empirical and can be applied to element-level [Brin <k Page 1998], group-level 
[Doreian L.H. Albert 1989], or network-level [Bacon] analyses. We discuss several 
commonly used vertex centrality indices in this section. Edge centrality indices are 
similarly defined. 



Preliminaries 

Consider a graph G = (V, E), where V is the set of vertices representing actors or 
entities in the complex network, and E is the set of edges representing relationships 
between the vertices. The number of vertices and edges is denoted by N and M, 
respectively. The graphs can be directed or undirected. We will assume that each 
edge e £ E has a positive integer weight w{e). For unweighted graphs, we use w{e) 
= 1. A path from vertex s to t is defined as a sequence of edges (m^, Wi+i) , 0 < i < 
where uq = s and ui = t. The length of a path is the sum of the weights of edges. 
We use d{s, t) to denote the distance between vertices s and t (the minimum length 
of any path connecting s and t in G). Let's denote the total number of shortest 
paths between vertices s and t by ast , and the number passing through vertex v by 
cyst{v). 



Degree centrality 

In an undirected network, the degree centrality of a vertex is simply the count of 
the number of adjacencies or neighbors it has. For directed graphs, we can define 
two variants: in-degree centrality and out-degree centrality. This is a simple local 
measure based on the notion of neighborhood and is straightforward to compute. In 
many networks, a high degree vertex is considered an important or central player, 
and this index quantifies that observation. 



Closeness centrality 

This index measures the closeness, in terms of distance^ of a vertex to all other ver- 
tices in the network. Vertices with a smaller total distance are considered more im- 
portant. Several closeness-based metrics [Bavc'las 1950, Nieniincn 1973] have been 
developed by the SNA community. A commonly used definition is the reciprocal of 
the total shortest path distance from a particular vertex to all other vertices 

CC(v) ^ ^ 



'}luevd{v,u) 

Unlike degree centrality, this is a global metric. To calculate the closeness centrality 
of a vertex u, we may perform a breadth-first traversal (BFS, for unweighted graphs) 
or use a single-source shortest paths (SSSP, for weighted graphs) algorithm. The 
closeness centrality of a single vertex can be determined in linear time for unweighted 
networks. 
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Stress central ity 

Stress centrality is a metric based on shortest paths counts, first presented in 
[Sliimbcl 1953]. It is defined as 

SC{v) = '^-*(") 

Intuitively, this metric deals with the communication work done by each vertex 
in a network. The number of shortest paths that pass through a vertex v gives 
an estimate of the amount of stress a vertex v is under, assuming communication 
is carried out through shortest paths all the time. This index can be calculated 
by using a variant of the all-pairs shortest paths algorithm and by computing the 
number of shortest paths between every pair of vertices. 

Betweenness centrality 

Betweenness centrality is another shortest paths enumeration-based metric, intro- 
duced by Freeman in [Freeman 1977]. Let 6st{v) denote the pairwise dependency, 
or the fraction of shortest paths between s and t that pass through v 

Ost{V) = 

Betweenness centrality of a vertex v is defined as 

BC{v)^ Sstiv) (12.1) 

This metric can be thought of as a normalized version of stress centrality. 
Betweenness centrality of a vertex measures the control a vertex has over commu- 
nication in the network and can be used to identify critical vertices in the network. 
High centrality indices indicate that a vertex can reach other vertices on relatively 
short paths or that a vertex lies on a considerable fraction of shortest paths con- 
necting pairs of other vertices. 

This index has been extensively used in recent years for analysis of social as 
well as other large-scale complex networks. Some applications include the analy- 
sis of biological networks [do] Sol ct al. 2005, Jcong et al. 2001, Piuu(\\- ct al. 2005], 
study of sexual networks and AIDS [Liljeros et al. 2001], identification of key ac- 
tors in terrorist networks [v 'oli'man et al. 2004, Krebs 2002], organizational behav- 
ior, supply chain management [Cisic et al. 2000], and transportation networks (see 
[Guimer et al. 2005]). 

There are a number of commercial and research software packages for SNA 
(e.g., Pajek [Batagc^j Sc A. Mrvar 1998], InFlow [Kr(-bs 2005], UCINET [UC'INET]) 
that can be used to determine these centrality metrics. However, they can only 
process comparatively small networks (in most cases, sparse graphs with less than 
40,000 vertices). Our goal is to develop fast, high-performance implementations of 
these metrics to process large-scale real-world graphs with millions to billions of 
vertices. 
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Algorithms for computing betweenness centrality 

We can evaluate the betweenness centrality of a vertex v (defined in equation (12.1)) 
by determining the number of shortest paths between every pair of vertices s and t 
and the number of shortest paths that pass through v. There is no known algorithm 
to compute the exact betweenness centrality score of a single vertex without solving 
an all-pairs shortest paths problem instance in the graph. In this chapter, we will 
constrain our discussion of parallel betweenness centrality algorithms to directed, 
unweighted graphs. To process undirected graphs, the network can be easily modi- 
fied by replacing each edge by two oppositely directed edges. While the approach to 
parallelization for unweighted graphs [Badcr & Maddm-i 2006b] works for weighted 
low-diameter graphs as well, the concurrency in each parallel phase is dependent on 
the weight distribution. 

Earlier algorithms compute betweenness centrality in two steps: first, the 
number and length of shortest paths between all pairs of vertices are computed 
and stored, and second, the pairwise dependencies (the fractions ^^^) for each 
s-t pair are summed. The complexity of this approach is 0{N^) time and 0{N^) 
space. Exploiting the sparse nature of real- world networks, Brandes [Brandes 2001] 
presented an improved sequential algorithm to compute the betweenness centrality 
score for all vertices in an unweighted graph in 0{MN) time and 0{M + N) space. 
The main idea is to perform N breadth-first graph traversals and augment each 
traversal to compute the number of shortest paths passing through each vertex. 
The second key idea is that pairwise dependencies Sgt (v) can be aggregated without 
computing all of them explicitly. Define the dependency of a source vertex s ^ V on 
a vertex v € V as Ss{v) = X^^gy Sgtiv)- The betweenness centrality of a vertex v can 
be then expressed as BC{v) = J^s^vev ^s{v)- Brandes showed that the dependency 
values Ss (v) satisfy the following recursive relation 

Ssiv)^ J2 ^il + 6siw)) (12.2) 

'w:d(s^w)—d{s.v)-\-l 

Thus, the sequential algorithm computes betweenness in 0{MN) time by iterating 
over all the vertices s £V and computing the dependency values 5s{v) in two stages. 
First, the distance and shortest path counts from s to each vertex are determined. 
Second, the vertices are revisited starting with the farthest vertex from s first, and 
dependencies are accumulated according to equation (12.2). 

12.3 Parallel centrality algorithms 

In this section, we present novel parallel algorithms to compute the various cen- 
trality metrics, optimized for real-world networks. To the best of our knowledge, 
these are the first parallel algorithms for centrality analysis. We exploit the typical 
low-diameter (small-world) property to reveal an additional level of parallelism in 
graph traversal and take the unbalanced degree distribution into consideration while 
designing algorithms for the shortest path-based enumeration metrics. In addition 
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to the exact algorithms, we also discuss approaches to approximate closeness and 
betweenness in [Bador ct al. 2007]. 

We use a compact array representation for the graph that requires M + N +1 
machine words. We assume that the vertices are labeled with integer identifiers 
between 0 and — 1. All the adjacencies of a vertex are stored contiguously in a 
block of memory, and the neighbors of vertex i are stored next to ones of vertex 
i + 1. The size of the adjacency array is M , and we require an array of pointers to 
this adjacency array, which is of size A'^+ 1 words. This representation is motivated 
by the fact that all the adjacencies of a vertex are visited in the graph traversal 
step after the vertex is first discovered. 

For parallel algorithm analysis, we use a complexity model similar to the one 
proposed by Helman and JaJa [Helnian & JaJa 2001], which has been shown to 
work well in practice. This model takes into account both computational complex- 
ity and memory contention. We measure the overall complexity of an algorithm 
using T„i{N, M, Np), the maximum number of noncontiguous accesses made by 
any processor to memory, and Tc{N,M,Np), the upper bound on the local com- 
putational complexity. Np denotes the number of processors or parallel threads of 
execution. 

Degree central ity 

We store the in- and out-degree of each vertex during construction of the graph 
abstraction. Thus, determining the degree centrality of a particular vertex is a 
constant-time lookup operation. As noted previously, degree centrality (also re- 
ferred to as vertex connectivity in statistical physics literature) is a useful local 
metric and probably the most studied measure in complex network analysis. 

Closeness centrality 

Closeness centrality of a vertex v can be computed on a sequential processor by a 
breadth-first traversal from v (or single-source shortest paths in case of weighted 
graphs) and requires no auxiliary data structures. Thus, vertex centrality computa- 
tion can be done in parallel by a straightforward parallelization of BFS. In a typical 
network analysis scenario, we would require centrality scores of all the vertices in 
the graph. We can then evaluate N BFS (shortest path) trees in parallel, one for 
each vertex v ^ V. On Np processors, this would yield Tc = 0( ^*^'^^ ) and 

Tjn = 0(^^) for unweighted graphs. For weighted graphs, using a naive queue- 
based representation for the expanded frontier, we can compute all the centrality 
metrics in Tc = 0{ ^^^^J^ ) and = O(^). The bounds can be further improved 
with the use of efficient priority queue representations. 

Evaluating closeness centrality of all the vertices in a graph is computationally 
intensive; hence, it is valuable to investigate approximate algorithms. Using a 
random sampling technique, Eppstein and Wang [Eppstcin & Wang 2001] showed 
that the closeness centrality of all vertices in a weighted, undirected graph can be 
approximated with high probability in O ( "^"^-^ {N log TV + M ) ) time and an additive 
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error of at most cAq (e is a fixed constant, and Aq is the diameter of the graph). 
The algorithm proceeds as follows. Let k be the number of iterations needed to 
obtain the desired error bound. In iteration i, pick vertex Vi uniformly at random 
from V and solve the SSSP problem with Vi as the source. The estimated centrality 
is given by 

CCiv) = 

The error bounds follow from a result by HoefFding [Hocffding 1963] on probability 
bounds for sums of independent random variables. 

We design a parallel algorithm for approximate closeness centrality as follows. 
Each processor runs SSSP computations from vertices and stores the evaluated 

distance values. The cost of this step is given by Tc = O(-^^^^^j^) and = 0{j^) 
for unweighted graphs. For real-world graphs, the number of sample vertices k 
can be set to to obtain the error bounds given above. The approximate 

closeness centrality value of each vertex can then be calculated in Oik) = O(i^) 
time, and the summation for all TV vertices would require Tc = 0( "'^°^^ ) and 
constant T^. 



Stress and betweenness centrality 

Computing stress and betweenness centrality involves shortest path enumeration 
between every pair of vertices in the network, and there is no known algorithm to 
compute the betweenness/stress centrality score of a single vertex or an edge in lin- 
ear time. We design two novel parallel algorithms for vertex betweenness centrality 
that retain the computational complexity of Brandes' sequential algorithm and that 
are particularly suited for real- world sparse networks. 

We observe that parallelism can be exploited in the centrality computation at 
two levels: 

• The BFS/SSSP computations from each vertex can be done concurrently, 
provided the centrality running sums are updated atomically. 

• A single BFS/SSSP computation can be parallelized. Further, adjacencies of 
a vertex can be processed concurrently. 

We will refer to the parallelization approach that concurrently computes the 
shortest path trees as the coarse-grained parallel betweenness centrality algorithm, 
and the latter approach, in which a single BFS/SSSP traversal is parallelized, as the 
fine-grained algorithm. Algorithm 12.1 gives the pseudocode for the fine-grained 
approach and describes the two main stages that are parallelized in each iteration. 
The loops that are executed in parallel (see lines 3, 13, 14, and 26) are indicated in 
the schematic. The coarse-grained algorithm uses the same data structures, but only 
the main loop (step 2) is parallelized. Note that accesses to shared data structures 
and updates to the distance and path counts need to be protected with appropriate 
synchronization constructs, which wc do not indicate in Algorithm 12.1. 
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Algorithm 12.1. Synchronous betweenness centrality. 

A level-synchronous parallel algorithm for computing betweenness centrality of ver- 
tices in unweighted graphs. 

b : M^J^ = ParallelBC(G = {V, E)) 

1 b = 0 

2 for e y 



3 do for t in parallel 

4 do P(t) : Z = empty multiset 

5 a:Z^ =0 

6 d : = -1 

7 cr(fc) = 1, d(fc) = 0 

8 i = 0, S{i) : Z = empty stack 

9 push(A;, S(t)) 

10 c = 1 

Graph traversal for shortest path discovery & counting 

11 while c> 0 

12 do c = 0 

13 for V e S{i) in parallel 

14 do for each neighbor w of v in parallel 

15 do if d{w) < 0 

16 do push(w;, S(« + 1)) 

17 c += 1 

18 diw) = d{v) + 1 

19 if d{w) = d{v) + 1 

20 do cr{w) += cr{v) 

21 APPEND (u,P(tt;)) 

22 i +=l 

23 i-=l 

Dependency accumulation by back-propagation 

24 5 : = 0 

25 while i > 0 

26 do for w G S(?) in parallel 

27 do for V €P{w) 

28 doS{v)+^^{l + 5{w)) 

29 b{w) += S{w) 

30 i-=l 



The fine-grained parallel algorithm proceeds as follows. Starting at the source 
vertex k, we successively expand the frontier of visited vertices and augment breadth- 
first graph traversal (we also refer to this as level-synchronous graph traversal) to 
count the number of shortest paths passing through each vertex. We maintain a 
multiset of predecessors P{w) associated with each vertex w. A vertex v belongs 
to the predecessor multiset of w if {v, w) G E and = d{v) + 1. We implement 
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each multiset as a dynamic array that can be resized at runtime, and so P is a two- 
dimensional data structure, with one array for each of the N vertices. The append 
routine used in line 21 of the algorithm atomically adds a vertex to the multiset. 
Clearly, the size of a predecessor multiset for a vertex is bounded by its in-degree 
(or degree, in the case of an undirected graph). The predecessor information is 
used in the dependency accumulation step, which implements equation (12.2). The 
other key data structure used in both the stages is S, the stack of visited vertices. 
S(i) stores all the vertices that are at a distance i from the source vertex, and the 
vertices are added atomically to the stack using the PUSH routine. We need 0{N) 
storage for S. The rest of the data structures (b, a, d, and S) are one-dimensional 
arrays. In an arbitrary network, there is a possibility that the path counts grow 
exponentially in the graph traversal, which might lead to an arithmetic overflow 
for some values of the path counts array cr. We use 64-bit unsigned integers in 
our betweenness implementation and have so far never encountered a path count 
overflow occurrence for a real-world network instance. The final scores need to be 
divided by two (not shown in the algorithm) if the graph is undirected, as all the 
shortest paths are counted twice. 

There are performance tradeoffs associated with both the coarse-grained and 
fine-grained algorithms, when implemented on parallel systems. The coarse-grained 
algorithm assigns each processor a fraction of the vertices from which to initiate 
graph traversals computations. The vertices can be assigned dynamically to proces- 
sors, so that work is distributed as evenly as possible. For this approach, the graph 
traversal requires no synchronization, and the centrality metrics can be computed 
exactly, provided they are accumulated atomically. Alternately, each processor can 
store its partial sum of the centrality score for every vertex, and all the sums can 
be merged using an efficient global reduction operation. However, the problem 
with the coarse-grained algorithm is that the auxiliary data structures (S, P, cr, 
d, and S) need to be replicated on each processor for doing concurrent traversals. 
The memory requirements scale as 0{Np{M + N)), and this approach becomes 
infeasible for large-scale graphs. 

In the fine-grained algorithm, we parallelize each graph traversal, and so 
the memory requirement is just 0{M + N). We exploit concurrency in the two 
stages of each iteration, graph traversal for path discovery and counting (lines 11 
to 23) and dependency accumulation by back-propagation (lines 24 to 30), from 
the fact the graph has a low diameter. In previous work, we presented multi- 
threaded algorithms and eflacient implementations for fine-grained parallel BFS (see 
[Badcr & Maddmi 200(1 ]) and SSSP (see [Crobak et al. 2007, Madduri et al. 2007]). 
Similar ideas can be applied to reduce the synchronization overhead in the access 
to shared data structures in the graph traversal phase. 

12.3.1 Optimizations for real-world graphs 

The unbalanced degree distribution is another important graph characteristic we 
need to consider while optimizing centrality algorithms. It has been observed that 
real networks tend to have highly skewed degree distributions that, in some cases, 
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can be approximated by power laws [Barabasi & R. Albert 2007, Newman 2UUii]. 
We see a significant number of low degree vertices in these networks and a compar- 
atively smaller number of vertices of very high degree (can be as large as 0{N)) 
[DaH'Asta ct al. 2006]. In some networks, centrality is strongly correlated with 
degree [Joong et al. 200j]: intuitively, high degree vertices may have high central- 
ity scores as a significant number of shortest paths pass through them. The de- 
gree distribution has very little effect on the performance of coarse-grained parallel 
centrality algorithms, as we do a full graph traversal on every outer-loop itera- 
tion. Each iteration roughly takes the same time, and even a static distribution 
of work among processors is reasonably well balanced. However, while design- 
ing fine-grained centrality algorithms, we need to explicitly consider unbalanced 
degree distributions. In a level-synchronized parallel BFS in which vertices are 
statically assigned to processors without considering their degree, it is highly prob- 
able that there will be phases with severe work imbalance. For instance, con- 
sider the case in which one processor is assigned a group of low degree vertices 
and another processor has to expand the BFS frontier from a set of high degree 
vertices (say, degree 0{N)). In the worst case, we will not achieve any parallel 
speedup using this approach. Our fine-grained BFS and shortest path algorithms 
[Badcr & Madduri 200Ga, MaddTui ct al. 2007] are designed to be independent of 
degree distribution, and we use these optimized algorithms as the inner routines for 
the fine-grained centrality algorithms. 

There are several other optimizations that are applicable to real-world net- 
works. If a network is composed of several large disjoint subgraphs, we can run the 
linear-time connected components algorithm to preprocess the network and iden- 
tify the components. The centrality indices of the various components can then be 
evaluated concurrently. Similarly, we can decompose a directed network into its 
strongly connected components. 

Observe that, by definition, the betweenness centrality score of a degree- 1 
vertex is zero (Figure 12.1). Also, we show that it is not necessary to traverse 
the graph from a degree- 1 vertex if we already have the shortest path tree from 
its adjacent vertex. For undirected networks, Algorithm 12.2 gives the modified 
pseudocode for the dependency accumulation stage (lines 25-30 of Algorithm 12.1) 
from the adjacency of a degree-1 vertex. With this revision, we can just skip the 
iteration from a degree-1 vertex. Instead, we increment the dependency score of 
all other vertices in the traversal from its adjacency (see line 5). Each degree-1 
vertex contributes to an increase of d{w) to the centrality score, and hence we add 
the term niS^w). In addition, we need to increment the centrality score of the 
source vertex (fc in this case) in this traversal itself. Line 7 of the routine gives the 
required increment to be factored in. This optimization is particularly effective in 
networks such as the web-graph and protein-interaction networks, in which there 
are a significant percentage of degree-1 vertices (unannotated proteins with few 
interactions; web pages linking to a hub-site). In case of directed networks, the 
increment values (steps 5 and 7) differ depending on the directivity of the edges, 
but the same idea is applicable. 
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Figure 12.1. Betweenness centrality definition. 

The betweenness centrality index of a degree-1 vertex is 0. We need not 
traverse the graph from a degree-1 vertex if we store the shortest path 
tree from its adjacency. 



Algorithm 12.2. Betweenness centrality dependency accumulation. 

Pseudocode for the parahel betweenness centrahty algorithm dependency accumu- 
lation stage (replacement for lines 25-30 in Algorithm 12.1) for a graph traversal 
from vertex k. k has rii > 0 adjacencies whose degree is 1. 

b : = ParallelBCAccumDeg1Adj(A;, i, S, cr, P, S) 

1 while i > 0 

2 do for w e S(i) in parallel 

3 do for V e P(u') 

4 do6{v)+=^^{l + d{w)) 

5 b(u;) -f = (tii + l)S{w) 

6 i~= 1 

7 h{k) += ni{S{k) - 1) 

12.4 Performance results and analysis 
12.4.1 Experimental setup 

In this section, we discuss parallel performance results of closeness, stress, 
and betweenness centrality. We implement both the fine-grained and coarse- 
grained approaches for betweenness. In addition, we optimize our implementa- 
tions for three different shared memory architectures: multicore systems, symmetric 
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multiprocessors (SMPs), and massively multithreaded architectures. The mul- 
tithreaded implementation is optimized for the Cray MTA-2 (see, for instance, 
[Bader & Madduri 2006a, Badcr & Madduri 200G1>]). The MTA-2 is a high-end 
shared memory system offering two unique features that aid considerably in the 
design of irregular algorithms: fine-grained parallelism and low-overhead word-level 
synchronization. It has no data cache; rather than using a memory hierarchy to 
reduce latency, the MTA-2 processors use hardware multithreading to tolerate the 
latency. The word-level synchronization support complements multithreading and 
makes performance primarily a function of parallelism. Since graph algorithms have 
an abundance of parallelism, yet often are not amenable to partitioning, the MTA-2 
architectural features lead to superior performance and scalability. 

For computing centrality metrics on weighted graphs, we use a fine-grained 
parallel SSSP algorithm [Madduri ct al. 2007] as the inner routine for graph traver- 
sal. However, the centrality accumulation step cannot be easily parallelized, as the 
concurrency is dependent on the weight distribution. The coarse-grained algorithm 
is straightforward and just requires minor changes to Algorithm 12.1. 

We report multithreaded performance results on a 40-processor Cray MTA-2 
system with 160 GB uniform shared memory. Each processor has a clock speed of 
220 MHz and support for 128 hardware threads. The code is written in C with MTA- 
2 specific pragmas and directives for parallelization. We compile the code using the 
MTA-2 C compiler (Cray Programming Environment [PE] 2.0.3). The MTA-2 code 
also compiles and runs on sequential processors without any modification. 

Our test platform for the SMP implementations is an IBM Power 570 server. 
The IBM Power 570 is a symmetric multiprocessor with 16 1.9 GHz Power5 cores 
with simultaneous multithreading (SMT), 32 MB shared L3 cache, and 256 GB 
shared memory. The code is written in C with OpcnMP directives for parallelization 
and compiled using the IBM XL C compiler v7.0. 

We report multicore system performance results on the Sun Fire T2000 server, 
with the Sun UltraSPARC Tl (Niagara) processor. This system has eight cores 
running at 1.0 GHz, each of which is four-way multithreaded. There are eight 
integer units with a six-stage pipeline on chip, and four threads running on a core 
share the pipeline. The cores also share a 3 MB L2 cache, and the system has a 
main memory of 16 GB. There is only one fioating-point unit (FPU) for all cores. 
We compile our code with the Sun C compiler v5.8. 



Network data 

We test our centrality metric implementations on a variety of real-world 
graphs, summarized in Table 12.1. We use the Recursive MATrix (R-MAT) (see 
[Chakrabarti <>t al. 2004]) random graph generation algorithm to generate synthetic 
input data that are representative of real-world networks with a small-world topol- 
ogy. The degree distributions of the Internet Movie Database (IMDB) test graph 
instance are shown in Figure 12.2. We observe that the degree distributions of most 
of the networks are unbalanced with heavy tails. This observation is in agreement 
with prior experimental studies. 
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Table 12.1. Networks used in the centrality analysis. 



Data set 


Source 


Network description 


ND-actor 


[ jarabasi 2007] 


An undirected graph of 392,400 
vertices (movie actors) and 
31,788,592 edges. An edge 
corresponds to a link between 
two actors, if they have acted 
together in a movie. The data 
set includes actor listings from 
127,823 movies. 


ND-web 


[Barabasi 2007] 


A directed network with 325,729 
vertices and 1,497,135 arcs. Each 
vertex represents a web page 
within the Univ. of Notre Dame 
nd.edu domain, and the arcs rep- 
resent from — to links. 


ND-yeast 


[Barabasi 2007] 


Undirected network with 2114 
vertices and 2277 edges. Ver- 
tices represent proteins, and the 
edges represent interactions be- 
tween them in the yeast network. 


UMD- 
human 


[Batagelj & A. Mrvar 200G] 


Undirected network with 18,669 
vertices and 43,568 edges. Ver- 
tices represent proteins, and the 
edges represent interactions be- 
tween them in the human intcr- 
actome. 


PAJ-patent 


[Batagelj & A. Mrvar 200G] 


A network of about 3 million U.S. 
patents granted between January 
1963 and December 1999, and 
16 million citations made among 
them between 1975 and 1999. 


PAJ-cite 


[Batagelj & A. Mrvar 2006] 


The Lederberg citation data set, 
produced using HistCite, in PA- 
JEK graph format with 8843 ver- 
tices and 41,609 edges. 



12.4.2 Performance results 

Figure 12.3 compares the single-processor execution time of closeness, betweenness, 
and stress centrality for three networks of different sizes, on the MTA-2 and the 
Power 570. All three metrics are of the same computational complexity and exhibit 
nearly similar running times practice. 

The MTA-2 performance results are for the fine-grained centrality implementa- 
tions. On SMPs, the coarse-grained version outperforms the fine-grained algorithm 
on current systems due to the parallelization and synchronization overhead involved 
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IMDB movie-actor network 
(392,400 vertices and 31,788,592 edges) 




0 H 1 1 1 1 1 

0 1 10 100 1000 10000 

Degree 

Figure 12.2. Vertex degree distribution of the IMDB movie-actor 
network. 



in the fine-grained version. We only liave a modest number of processors on current 
SMP systems, so eacli processor can run a concurrent shortest path computation 
and create auxiUary data structures. On our target system, the Power 570, we can 
compute centrality metrics for graphs witli up to 100 milUon edges by using this 
coarse-grained implementation. 

Figure 12.4 summarizes multiprocessor execution times for computing be- 
tweenness centrality on the Power 570 and the MTA-2. Figure 12.4(a) gives the 
running times for the ND-actor graph on the Power 570 and the MTA-2. As ex- 
pected, the execution time scales nearly linearly with the number of processors. It 
is possible to evaluate the centrality metric for the entire ND-actor network in 42 
minutes on 16 processors of the Power 570. We observe similar performance for the 
patents citation data. This includes the optimizations for undirected, unweighted 
real- world networks discussed in Section 12.3.1. 

Figures 12.4(c) and 12.4(d) plot the execution time on the MTA-2 and the 
Power 570 for ND-web, and a synthetic graph instance of the same size generated 
using the R-MAT algorithm, respectively. Note that the actual execution time is 
dependent on the graph structure; for the same problem size, the synthetic graph 
instance takes much longer than the ND-web graph. The web crawl is a directed 
network, and splitting the network into its strongly connected components and the 
degree-1 optimization helps in significantly reducing the execution time. 
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Figure 12.3. Single-processor comparison. 

Single-processor execution time comparison of the centrality metric im- 
plementations on the IBM Power 570 (top) and the Cray MTA-2 (bot- 
tom). 



12.5 Case study: Betweenness applied to 
protein-interaction networks 

We will now apply the betweenness centrality metric to analyze the human interac- 
tome. Researchers have paid particular attention to the relation between centrality 
and essentiality or lethality of a protein (for instance, [.Tc-onci, ct al. 200]]). A pro- 
tein is said to be essential if the organism cannot survive without it. Essential pro- 
teins can only be determined experimentally, so alternate approaches to predicting 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



12.5. Case study: Betweenness applied to protein-interaction networks 



269 




(c) (d) 

Figure 12.4. Parallel performance comparison. 

Parallel performance of exact betweenness centrality computation for 
various graph instances on the Power 570 and the MTA-2. 



essentiality are of great interest and have potentially significant applications, such as 
drug target identification [li^oiii^ cf al. 20()-'-')]. Previous studies on yeast have shown 
that proteins acting as hubs (or high degree vertices) are three times more likely 
to be essential. So we wish to analyze the interplay between degree and centrality 
scores for proteins in the human PIN. We derive our human protein-interaction map 
(referred to as HPIN throughout the chapter) by merging interactions from a human 
proteome analysis data set [Gandhi ct al. 2000], a data snapshot from the Human 
Protein Reference Database [Peri ot al. 2003], and IntAct [Hcrnijakol) (^t al. 2004]. 
Each protein is represented by a vertex, and interactions between proteins are mod- 
eled as edges. This results in a high-confidence protein-interaction network of 18,869 
proteins and 43,568 interactions. 

Figure 12.5 plots the betweenness centrality scores of the top 1% (about 100) 
proteins in two lists, one ordered by degree and the other by the betweenness 
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Figure 12.5. The top 1% proteins. 

HPIN proteins sorted by betweenness centrality (BC) scores and the 
number of interactions. 



centrality score. We observe that there is a strong correlation between the degree 
and betweenness centrality score: about 65% of the proteins are common to both 
lists. The protein with the highest degree in the graph also has the highest central- 
ity score. This protein {Solute carrier family 2 member 4, Gene Symbol SLC2A4, 
HPRD ID 00688) belongs to the transport/cargo protein molecular class, and its 
primary biological function is transport. From Figure 12.5, it should also be noted 
that the top 1% proteins by degree show a significant variation in betweenness cen- 
trality scores. The scores vary by over four orders of magnitude, from 10~^ to 10~^. 

We next study the correlation of degree with betweenness centrality. Unlike 
degree, which ranges from 1 to 822, the values of betweenness centrality range over 
several orders of magnitude. The few highly connected (or high degree) vertices 
have high betweenness values, as there are many vertices directly and exclusively 
connected to these hubs. Thus, most of the shortest paths between these vertices 
go through these hubs. However, the low-connectivity vertices show a significant 
variation in betweenness values, as evidenced in Figure 12.6 (top). They exhibit a 
variation of betweenness of values up to four orders of magnitude. The high between- 
ness scores may suggest that these proteins are globally important. Interestingly, 
these vertices are completely absent in synthetically generated graphs designed to 
explain scale-free behavior (observe the variation of betweenness centrality scores 
among low degree vertices in Figure 12.6 (bottom). 

Our observations are further corroborated by two recent results. As the yeast 
PIN has been comprehensively mapped, lethal proteins in the network have been 
identified. Gandhi et al. [Gandhi et al. 200G] demonstrated from an independent 
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Figure 12.6. Normalized HPIN betweenness centrality. 

Normalized betweenness centrality scores as a function of the degree for 
HPIN (top) and a synthetic scale-free graph instance (bottom). 



analysis that the relative frequency of a gene to occur as an essential one is higher 
in the yeast network than in the HPIN. They also observe that the lethality of a 
gene could not be confidently predicted on the basis of the number of interaction 
partners. Joy et al. [J(j>' ct al. 2()(J5] confirmed that proteins with high betweenness 
scores are more likely to be essential and that there are a significant number of high- 
betweenness, low-interaction proteins in the yeast PIN. For a detailed discussion of 
our analysis of HPIN, see [Badcr k Ivladduri 2008]. 
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Figure 12.7. Betweenness centrality performance. 

Betweenness centrality execution time and speedup on the Sun Fire T2000 system. 



Figure 12.7 plots the execution time and relative speedup achieved on the Sun 
Fire T2000 for computing the betweenness centrality on HPIN. The performance 
scales nearly linearly up to 16 threads, but plateaus between 16 and 32 threads. This 
can be attributed to a saturation of memory bandwidth with 16 threads of execution, 
as well as the presence of only one floating-point unit on the entire chip. We use 
the floating-point unit for accumulating pair dependencies and centrality values. 

12.6 Integer torus: Betweenness conjecture 

Betweenness centrality is a key kernel in the DARPA HPCS SSCA#2 (see 
[Bader et al. 2006]), a benchmark extensively used to evaluate the performance of 
emerging high performance computing architectures for graph analytics. Parallel 
implementations of multithreaded graph algorithms are often prone to program- 
ming error and are computationally expensive to validate; since the integer torus is 
a regular network that is easy to generate, we propose using it as a test instance 
for the SSCA#2 benchmark. A simple validation check for the benchmark would 
be to compare the computed betweenness scores generated from a computational 
routine with the exact analytical expression derived in this section. 

For n £ N, let Tn denote an integer torus, that is the two-dimensional in- 
teger lattice mod n. Based on empirical evidence from extensive computational 
experimentation, we have the following. 

Conjecture 1. For v E Tn 

3 

BC{v) = — — — — + 1 when n is odd (12-3) 
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and 

BC{v) = — — + 1 when n is even (12.4) 

There is a parity dependence because of the impact of geodesies whose horizontal 
and/or vertical distance is maximal when n is even. For s,t G Tn with s = {x,y) 
and t = {x' , y'), let 

dnis, t) = min{(x' ~ x) mod n, {x — x') mod n} 

dv{s, t) = min{(y' — y) mod n, (y — y') mod n} 

d(s, t) = dnis, t) + (s, t) 

denote the horizontal, vertical, and total distance, respectively, from s to t. Note 
that 0 < dH{s,t),dv{s,t) < When n is even and d/f(s,t) = ^, then we say that 
s and t are horizontal diameter achieving. Similarly, when dv{s,t) = then s and 
f are said to be vertical diameter achieving. 

For 0 < dH{s,t),dv{s,t) < we have that 

_ /'dH{s,t) +dv{s,t)\ _ fdH{s,t) + dv{s,ty 
"^'-y dv{s,t) )~[ dH{s,t) 

If n is even and s and t are either horizontal or (exclusively) vertical diameter 
achieving, then 

^dH{s,t) + dy(s,i)^ 

dv{s,t) 

If n is even and s and t are both horizontal and vertical diameter achieving, then 

UH{s,t) + dv{sJ)'' 
dv{s,t) 

Let V = {p, q) for p, g e N with Q <p,q < n. Observe that there is a shortest 
path between s and t, which passes through v if and only if 

dnis, t) = duis, v) + duiv, t) and rfy (s, t) = dv{s, v) + dv{v, t) 

Since <yst{v) = (Tsv'^vi., wc will calculate BC{v) by counting all geodesies from s to 
V and from v to t where (Tstiv) is nonzero. We exploit the symmetries of Tn and 
enumerate shortest paths through vo = (0, 0) for particular subsets of s,t G Tn- For 
S,TCTn and vq = (0, 0), let 

A{S,T)= J2 ^^t{vo) 

s€S,teT,s^v^t 

Let m = [^\ and x,y G {— m, . . . , 0, . . . , m} for {x, y) G Tn- We divide Tn into four 
quadrants, centered at vq 

Qi = {{-x, -y) eTn\0 < x,y <m} 
Qi = {{-X, y) &Tn\0<x,y<m} 

Qs = {(x, y) eTn\0<x,y<m} 
Q4 = {{x, -y) eTn\0<x,y<m} 
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Since exactly which subsets of Tn are used depends on the parity oi n, we consider 
the two cases separately. 

1 2.6.1 Proof of conjecture when n is odd 

We first prove equation (f2.3) of the conjecture, since there are no diameter achiev- 
ing pairs of vertices when n is odd. 

Lemma 12.1. For vq — (0, 0) G 7^ with n odd 



Proof. In the calculation of BC{vo), we sum over all possible pairs s and t, where 
a path from s to i is counted as distinct from the reversed path that begins at t and 
ends at s. Thus, the first four terms count all possible paths from s to i along the 
four "diagonals" through vq, with redundancy. By the symmetries of Tn, these four 
values are all equal to A((3i , Qs) which counts the fraction of shortest paths through 
Vq from an s G Qi, the "lower-left" quadrant (modulo n) with respect to (0,0), to 
t in the upper-right quadrant Q^. The remaining additive factors then correct for 
the overcounting of geodesies along the vertical and horizontal, respectively, lines 
through Vq. Like before, these are all equal to A((3i, Q2) where o'st(wo) ^ 0 if and 
only if X — 0 — x' for (x, y) G Qi and (x', y') G Q2- 

Note that each of the paths where one of s and t lies on the horizontal line 
through Vq and the other lies on the vertical line through vq is counted exactly 
twice (once in each direction) in 4 • A{Qi,Q2)- For instance, A{Qi,Q2) counts 
paths where s lies on the horizontal line to the left of vq (again modulo n) or on 
the vertical line below vq and where t lies on the horizontal line to the right of vq 
or on the vertical line above vq. □ 

Theorem 12.2. Let n be an odd integer. Suppose vq = (0,0) € Tn and m — \_^\, 
with 



BC{vq) = A{Qi,Q3) + A(Q3, Qi) + A(Q2, Q4) + A(Q4, Q2) 
-A(Qi, Q2) - A(Q3, Q4) - A{QuQi) - A(Q2, Q3) 
= 4- A(Qi,Q3)-4- A(Qi,Q2) 



Qi 
Q2 
Q3 



{{-x, -y) eTn\0 < x,y <m} 
{{-X, y) eTn\Q < x,y <m} 
{{x,y) GTn\0<x,y<m} 



Then 



A(Qi,Q2) 



m{m — 1) 



(12.5) 



2 



and 



A(Qi,Q3) = l + (TO-l)(m + l)2 



(12.6) 
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Proof. Let vo = (0,0) S Tn for an odd integer n. Consider s,t e Tn where 
s ^ vq ^ t and vq lies on a shortest path from s to t. Let m = [^J and 

dH{s,t) — h, dv{s^t) = k 

dH{s,vo) = i, dv{s,vo)=j 
dH{vo,t) = i' = h-i, dy(uo,t) = j' = k - j 

for 0<i<h<m<^ and 0<j<k<m<^ where not both i — j — 0 nor 
i = h, i ~ k nor h = k = 0. In this notation, we have that 



f{h-t) + {k-j)\ fh + k 



Suppose further that s £ Qi and t =e Q2- Since crst(wo) = 0 unless x = Q = x' , 
we need only consider i = 0. Since j = 0 implies that s = vq and j — k implies that 
t = Vq, we consider only Q < j < k. When i = 0 and 0<j<A;<m<|-, then 
a St = 1 and crst(wo) = 1. Thus, since i = Q and — 0 

fe=2j=l V 0 / 

m(TO — 1) 



where we begin the outer summation at fc = 2 since h = Q, k ~ Q implies that 
s = Vq = t and /i = 0, fc = 1 implies that either s — ov t — vq. Hence, 
equation (12.5) in Theorem 12.2 holds. 

Suppose now that s G Qi and t G Q^. As with our previous equality, we 
enumerate A^QijQ^) by summing over the possible values oiQ<i<h<m<^ 
and 0<j<fc<m<^ where not both i = j = 0 nor i ^ h, j — k nor h — k — 0. 
Also, as before, if = 0, then fc > 1 and vice versa. Thus 

h=0 k=0 \ \ k J i=0 j=Q \ / \ \ ' 

where we have corrected for i — 0 — j and ft, — i = 0 = fc — jby subtracting out the 
terms (°o°)(''fc'') and {'"f') ■ Likewise, we have corrected for /i = 0, fc = 0 by 
adding 1. Note that when h — Q,k — 1 and ft = 1, fc = 0, the expression inside the 
ft and fc summands is zero and no correction is needed. We know that 

't+j\f{h-i) + {k-j)\ fh + k + 1 



E 



s i J \ (h — i) J \ fc 

either as an application of equation (5.26) from [Graham et al. 1989] (found via 
identity # 3100005 on the Pascal's Triangle website* or proved directly via induction 



http: / /binomial. csuhayward.edu/. 
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and parallel summation. Hence, we have that 

A(Q.Q3) = l + EEf-2 + 7xLEC^r' 

/i=Ofe=0 \ \ k ) i=0 ^ 

" " ^ 1 ^h + k + l fh + k 



rn rn 



^i + EE^'^+^-i) 

h=0 k=0 

= 1 + (m - l)(m + 1)^ 
and equation (12.6) of Theorem 12.2 also holds. □ 
Since m = and 

BCivo) = 4 • A(gi, Qg) - 4 • A(Qi, Qs) = for aU v e 7;^ 

as an immediate consequence of Theorem 12.2, we have that 
Corollary 12.3. When n is odd 

M.S.I Proof of conjecture when n is even 

The proof of equation (12.4) of the conjecture is similar, although considerably more 
complicated when n is even and m = ^ . The complications are due to any diameter 
achieving pairs s,t € Tn, which double the number of geodesies when either c?/f (s, t) 
or dv{s,t) — J and quadruple it when s and t are both horizontal and vertical 
diameter achieving. 
Again, we let 

Qi = {{-X, -y) e 7^1 I 0 < x, y < m} 
Q2 = {{-x,y)eTn |0<a;,?;<m} 
Q-i = {{x, y) ^Tn\0<x,y <m} 
Qi = {{x, -y) e 7^1 I 0 < x, y < to} 

Also, for s, t, vo — (0, 0) G 7^i, we will use the same notation of 

dnis^t) = h, dv{s,t) = k 

dH{s,vo)=i, dvis,va)=j 
dH{vo,t) = i' = h-i, dvivQ.t) = j' = k - j 
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for 0 < i < h < m, = ^ and 0<j<k<m=^ where not both i = j = 0 nor 
i = h,j = k nor h = k = 0. 

When n was odd, we were able to compute BC{vo) as a function only of 
A{Qi,Q3) and A{Qi,Q2). Now that n is even, the basic approach is the same 
except that we must consider a number of different subcases due to the impact of 
the diameter achieving pairs on the enumeration of 



Sst{vo) 



■ 



In our notation, we still have that 



fih-i) + {k- j)\ ^ fh + k 



when 0<i</i<m=§, 0<j<fc<m=§. If instead h = m but i,h — i ^ m, 
then we have 



'SVo 



i + f{h-i) + {k-j)\ ^fh + k 



By the symmetry between i and j, h and k, this is also true for A; = to and 
j, k — j ^ m. When h = m and i = 0, h = m and i = m, k = m and j = 0, or 
k = m and j = to, then 

')(i+j\(ih-i) + ik-j)\ 

^stivo) = ^^x+fc^ 

When h = k = m, then s and t are both horizontal and vertical diameter achieving. 
Ifl<i<m— 1 and 1 < j < m — 1, then 

i + f{h-i) + {k-j)\ , Jh + k 

If exactly one of i, h — i, j , ov k — j ]s also diameter achieving, then 

0(i+3\(if>'-i)+{k-j)\ 
s: , ^ i )\ (h-i) ) 

^stivo) = 
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while if both i and j or both h — i and k — j are diameter achieving, then 



Sst{vo) 



If we enumerate Sst{vo) for all these different cases, then we will be able to calculate 
BC{vo) as we did before. 

Theorem 12.4. Let n be an even integer. Suppose vq = (0,0) G Tn and m = ^. 
Then 

BC{vo) = A-A{Si,Ti) 

+ 4-2- A(52,T2) 
+ 2-4-A(53,T3) 
+ 4- A(54,T4) 
+ 2-4- A(^5,T5) 

-4-A{Sr,Tr) 

where, for dH{s,vo) = i, dv{s,vo) = j, djj(s,f) = h, and dv{s,t) = k 



Si 


= {seQi 


0<i<h<m-l,0<j<k<m-l} 


Ti 


= {t€Q3 


0<h-i<h<m-l,0<k-j<k<m-l} 


S2 


= {sGQi 


1 < i < h = m,0 < j < k < m - 1} 


T2 


= {teQ3 


1 < h - i < h = m,0 < k - j < k < m - 1} 


S3 


= {s€Qi 


i = 0,0<j<k<m-l} 


T3 


= {teQ3 


h — i = m,0 < k — j < k < m — 1} 


S4 


= {sGQi 


1 < i < h = m,! < j < k = m} 


T4 


= {teQ3 


l<h — i<h = m,l<k — j<k = m} 


S5 


= {sgQi 


i = 0,1 < j < k = m} 




= {teQ3 


h — i = m,l < k — j < k = m} 


Se 


= {s€Qi 


z = 0, j = m} 


Te 


= e Q3 


h — i = m, k — j = 0} 


S7 


= {s€Qi 


i = 0,1 < j < k = m} 


T7 


= {teQ3 


h — i = 0,1 < k — j < k = to} 



and in each case not both i = j = 0 nor i = h,j = k nor h = k = 0. 
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Proof. The different pairs Si, Ti for 1 < z < 7 correspond to the foUowing cases: 



Si,Ti 


0<h< 


m — 


1 


0 < fc < TO - 1 




0<i<h 




0 <j <k 








m 




0 < A: < TO - 1 


1 


<i<h- 


1 


0 <j <k 






0<h< 


m — 


1 


k — m 




0<i<h 




l<j<k- 


1 


5*3 , X3 


h = 


m 




0 < A; < TO - 1 




i = 0 




0 <j <k 






h = 


m 




0 < A; < TO - 1 




i ^ m 




0 < j < k 






0 <h< 


m — 


1 


k — m 




0<i <h 










0 < h < 


Tfl — 


1 


k = TO 




0 < i < h 




j — rn, 




54, T4 


h = 


m 




fc = TO 


1 


<i<h- 


1 


l<3<k- 


1 






m 




fc = TO 




i = 0 




l<j<k~- 


1 






m 




fc = TO 




i = m 




l<j<k- 


1 




h ^ 


m 




fc = TO 


1 


<i <h- 


1 








h = 


m 




fc = TO 


1 


<i <h- 


1 


j 






h = 


m 




fc = TO 




i = 0 




j 




S7,Tr 


h = 


0 




2 < fc < TO 




i = 0 




l<j<k= 


m 



For Si C Qi,Ti C Q3 and 5*2 C Qi,T2 C Q3, there are corresponding distinct sets 
in each of Q2,Qi and Q^,Qi and Qa,Q2- This is also true for 6*4, T4. For 5*3, T-^, 
however, we also have that 6*3 C Qa and C Q2- Thus, we only multiply A(S'3, T3) 
by a factor of two to account for the reverse paths from Q3 n Q2 to Qi n (^4. This 
is also the case for 5*5, T5. For 5*6, Tg, we first note that i = m,j = 0 gives the same 
path, just in the opposite direction. Since this is the only such path, it is counted 
twice. Finally, 6*7, T7 correct for the overcounting along the horizontal and vertical 
lines through vq, once in each direction, for a total factor of four. □ 

Theorem 12.5. Suppose the assumptions of Theorem 12.4 hold. Then 

A(S'i,Ti) = 1 + (to - 2)to2 
(to — 1)to(3to + 1) 



MS2,T2) 



A(53,T3) = 



4(m + 1) 
(m — 1)to 
2(m + 1) 



A(^., n) = ^( (rn - 3) + + 2(1) + 2 



m \ TO 



/m+m+l\ _ Y _ P-ra\ 
A(i5,i5j = --2— 

Km/ 
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Proof. The result follows from the following summations, and extensive use of the 
equality on page 275 



III/ — j_ 1 1 ii — J. / J ( fc n, 

h=0 k=0 \ \ k ) 1=0 j=Q 



i J \ {h ~ i) 



k=0 ^\ k > i=l j=0 ^ / ^ ^ 

'""^ 1 V- /^o + A A™ - 0) + (fc - .?) 



A(53,T3)=E|-l+(^Ev 0 A (™-0) 



*;=0 



1 X ^ x ^ / * + J \ / (?T7, — ij + (to — J j 



A(54,T4) - E E I , M 

MS T) ^ vVo + jV(^-o) + ("^-jA 



A(56,T6) 



\ m J j = l ^ 

1 /O + to\ / (to — 0) + (to — to) 



m+mx y 0 y V (m - 0) 



m-lfe-1 /0+iWO+(fe-j)\ m-1 /0+jWO+(m-j)\ 
fc=2 i=l V 0 / j = l 0 / 

As an immediate consequence of Theorems 12.4 and 12.5, we have the follow- 
ing. 

Corollary 12.6. When n is even, 



2 



BC{v) = y - + 1 
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Chapter 13 

Implementing Sparse 
Matrices for Graph 
Algorithms 



Aydin Bulug*, John Gilbert^ , and Viral B. Shah) 



Abstract 

Sparse matrices arc; a key data structure for implementing graph algo- 
rithms using linear algebra. This chapter reviews and evaluates storage 
formats for sparse matrices and their impact on primitive operations. 
We present complexity results of these operations on different sparse 
storage formats both in the random access memory (RAM) model and 
in the input/output (I/O) model. RAM complexity results were known 
except for the analysis of sparse matrix indexing. On the other hand, 
most of the I/O complexity results presented are new. The chapter 
focuses on different variations of the triples (coordinates) format and 
the widely used compressed sparse row (CSR) and compressed sparse 
column (CSC) formats. For most primitives, we provide detailed pseu- 
docodes for implementing them on triples and CSR/CSC. 



13.1 Introduction 

The choice of data structure is one of the most important steps in algorithm design 

and implementation. Sparse matrix algorithms are no exception. The represen- 
tation of a sparse matrix not only determines the efficiency of the algorithm, but 
also influences the algorithm design process. Given this bidirectional relationship, 

*High Performance Computing Research, Lawrence Berkeley National Laboratory, 1 Cyclotron 
Road, Berkeley, CA 94720 (abuluc®lbl.gov). 

^Computer Science Department, University of California, Santa Barbara, CA 93106-5110 
(gilbertScs .ucsb.edu, viral Smayin . org) . 
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this chapter reviews and evaluates sparse matrix data structures with key primitive 
operations in mind. In the case of array-based graph algorithms, these primitives 
are sparse matrix vector multiplication (SpMV) , sparse general matrix matrix mul- 
tiplication (SpGEMM), sparse matrix reference/assignment (SpRef/SpAsgn), and 
sparse matrix addition (SpAdd). The administrative overheads of different sparse 
matrix data structures, both in terms of storage and processing, are also important 
and are exposed throughout the chapter. 

Let A e g^ixN |-,g ,^ sparse rectangular matrix of elements from an arbitrary 
semiring §. We use nnz{A) to denote the number of nonzero elements in A. When 
the matrix is clear from context, we drop the parentheses and simply use nnz. For 
sparse matrix indexing, we use the convenient Matlab colon notation, where A(:, i) 
denotes the ith column, A{i,:) denotes the ith row, and A{i,j) denotes the element 
at the (i, j)th position of matrix A. For one-dimcnsional arrays, a(«) denotes the 
ith component of the array. Indices are 1-based throughout the chapter. We use 
flops(AopB) to denote the number of nonzero arithmetic operations required by 
the operation AopB. Again, when the operation and the operands are clear from 
the context, we simply use flops. To reduce notational overhead, we take each 
operation's complexity to be at least one, i.e., we say O(-) instead of 0(max(-, 1)). 

One of the traditional ways to analyze the computational complexity of a 
sparse matrix operation is by counting the number of floating-point operations per- 
formed. This is similar to analyzing algorithms according to their RAM complexities 
(see [Alio ct al. 1974]). As memory hierarchies became dominant in computer archi- 
tectures, the I/O complexity (also called the cache complexity) of a given algorithm 
became as important as its RAM complexity. Cache performance is especially im- 
portant for sparse matrix computations because of their irregular nature and low 
ratio of flops to memory access. One approach to hiding the memory-processor 
speed gap is to use massively multithreaded architectures [Fco ct al. 2005]. How- 
ever, these architectures have limited availability at present. 

In the I/O model, only two levels of memory are considered for simplicity: 
a fast memory and a slow memory. The fast memory is called cache and the 
slow memory is called disk, but the analysis is valid at different levels of memory 
hierarchy with appropriate parameter values. Both levels of memories are parti- 
tioned into blocks of size L, usually called the cache line size. The size of the 
fast memory is denoted by Z. If data needed by the CPU is not found in the 
fast memory, a cache miss occurs, and the memory block containing the needed 
data is fetched from the slow memory. The I/O complexity of an algorithm can be 
roughly defined as the number of memory transfers it makes between the fast and 
slow memories [AQ,Q,arwal Vitter 198X]. The number of memory transfers does 
not necessarily mean the number of words moved between fast and slow memories, 
since the memory transfers happen in blocks of size L. For example, scanning an 
array of size N has I/O complexity N/L. In this chapter, scan{A) = nnz{A)/L 
is used as an abbreviation for the I/O complexity of examining all the nonzeros 
of matrix A in the order that they are stored. Figure 13.1 shows a simple mem- 
ory hierarchy with some typical latency values as of today. Meyer, Sanders, and 
Sibeyn provided a contemporary treatment of algorithmic implications of memory 
hierarchies [Meyer ct al. 2002]. 
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Figure 13.1. Typical memory hierarchy. 

Approximate values of memory size and latency (partially adapted from 
Hennessy and Patterson [Hcuucssy k: Patterson 200G]) assuming a 2 
GHz processing core. 



We present the computational complexities of algorithms in the RAM model 
as well as the I/O model. However, instead of trying to come up with the most I/O 
efhcient implementations, we analyze the I/O complexities of the most widely used 
implementations, which are usually motivated by the RAM model. There are two 
reasons for this approach. First, I/O optimality is still an open problem for some of 
the key primitives presented in this chapter. Second, I/O efficient implementations 
of some key primitives turn out to be suboptimal in the RAM model with respect 
to the amount of work they do. 

Now, we state two crucial assumptions that are used throughout this chapter. 

Assumption 1. A sparse matrix with dimensions M x N has nnz > M,N. More 
formally, nnz — n{N,M). 

Assumption 1 simplifies the asymptotic analysis of the algorithms presented 
in this chapter. It implies that when both the order of the matrix and its num- 
ber of nonzeros are included as terms in the asymptotic complexity, only nnz is 
pronounced. While this assumption is common in numerical linear algebra (it is 
one of the requirements for nonsingularity) , in some parallel graph computations it 
may not hold. In this chapter, however, we use this assumption in our analysis. In 
contrast, Bulug and Gilbert gave an SpGEMM algorithm specifically designed for 
hypersparse matrices (matrices with nnz < N,M) [Bulug hz Gilbert 2008b]. 

Assumption 2. The fast memory is not big enough to hold data structures of 
0{N) size, where N is the matrix dimension. 
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We argue that Assumption 2 is justified when the fast memory under consider- 
ation is either the LI or L2 cache. Out-of-order CPUs can generally hide memory la- 
tencies from LI cache misses, but not L2 cache misses [Hennessy & Patterson 2006]. 
Therefore, it is more reasonable to treat the L2 cache as the fast memory and 
RAM (main memory) as the slow memory. The largest sparse matrix that fills the 
whole machine RAM (assuming the triples representation that occupies 16 bytes per 
nonzero, has 2'^°/16 = 2^^ nonzeros per GB of RAM). A square sparse matrix, with 
an average of eight nonzeros per column, has dimensions 2^'^ x 2^^ per GB of RAM. 
A single dense vector of double-precision floating-point numbers with 2^^ elements 
would require 64 MB of L2 cache memory per GB of RAM, which is typically much 
larger than the typical size of the L2 cache. 

The goal of this chapter is to explain sparse matrix data structures progres- 
sively, starting from the least structured and most simple format (unordered triples) 
and ending with the most structured formats: compressed sparse row (CSR) and 
compressed sparse column (CSC). This way, we provide motivation on why the 
experts prefer to use GSR/CSC formats by comparing and contrasting them with 
simpler formats. For example, GSR, a dense collection of sparse row arrays, can 
also be viewed as an extension of the triples format enhanced with row indexing ca- 
pabilities. Furthermore, many ideas and intermediate data structures that are used 
to implement key primitives on triples are also widely used with implementations 
on GSR/CSC formats. 

A vast amount of literature exists on sparse matrix storage schemes. Some 
additional other specialized data structures that are worth mentioning include 

• Blocked compressed stripe formats (BCSR and BCSC) use less bandwidth to 
accelerate bandwidth limited computations such as SpMV. 

• Knuth storage allows fast access to both rows and columns at the same time, 
and it makes dynamic changes to the matrix possible. Therefore, it is very 
suitable for all kinds of SpRef and SpAsgn operations. Its drawback is its 
excessive memory usage (5 nnz +2M) and high cache miss ratio. 

• Hierarchical storage schemes such as quadtrees [Samet 1984, Wise & Franco 1990] 
are theoretically attractive, but achieving good performance in practice re- 
quires careful algorithm engineering to avoid high cache miss ratios that would 
result from straightforward pointer-based implementations. 

The rest of this chapter is organized as follows. Section 13.2 describes the key 
sparse matrix primitives. Section 13.3 reviews the triples/coordinates representa- 
tion, which is natural and easy to understand. The triples representation generalizes 
to higher dimensions [Bador k' Kolda 2007]. Its resemblance to database tables will 
help us expose some interesting connections between databases and sparse matri- 
ces. Section 13.4 reviews the most commonly used compressed storage formats 
for general sparse matrices, namely CSR and CSC. Section 13.5 presents a case 
on how sparse matrices are represented in the Star-P programming environment. 
Section 13.6 concludes the chapter. 
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1 3.2 Key primitives 

Most of the sparse matrix operations have been motivated by numerical linear 
algebra. Some of them are also useful for graph algorithms: 

1. Sparse matrix indexing and assignment (SpRef/SpAsgn): Corresponds to sub- 
graph selection. 

2. Sparse matrix-dense vector multiplication (SpMV): Corresponds to breadth- 
first or depth-first search. 

3. Sparse matrix addition and other pointwise operations (SpAdd): Corresponds 
to graph merging. 

4. Sparse matrix-sparse matrix multiplication (SpGEMM) : Corresponds to 
breadth- first or depth-first search to/from multiply vertices simultaneously. 

SpRef is the operation of storing a submatrix of a sparse matrix in another 

sparse matrix (B = A(p, q)), and SpAsgn is the operation of assigning a sparse 
matrix to a submatrix of another sparse matrix (B(p,q) = A). It is worth noting 
that SpAsgn is the only key primitive that mutates its sparse matrix operand in 
the general case. Sparse matrix indexing can be quite powerful and complex if we 
allow p and q to be arbitrary vectors of indices. Therefore, this chapter limits itself 
to row wise (A(i,:)), column wise (A(:,z)), and element-wise {A{i,j)) indexing, 
as they find more widespread use in graph algorithms. SpAsgn also requires the 
matrix dimensions to match, e.g., if B(:, i) = A where B G S^^^ , then A G E>^^^ . 

SpMV is the most widely used sparse matrix kernel since it is the workhorse of 
iterative linear equation solvers and eigenvalue computations. A sparse matrix can 
be multiplied by a dense vector either on the right (y = Ax) or on the left (y' = 
x'A). This chapter concentrates on the multiplication on the right. It is generally 
straightforward to reformulate algorithms that use multiplication on the left so 
that they use multiplication on the right. Some representative graph computations 
that use SpMV are page ranking (an eigenvalue computation) , breadth- first search, 
the Bellman-Ford shortest paths algorithm, and Prim's minimum spanning tree 
algorithm. 

SpAdd, C = A 0 B, computes the sum of two sparse matrices of dimensions 
M X N. SpAdd is an abstraction that is not limited to any summation operator. 
In general, any pointwise binary scalar operation between two sparse matrices falls 
into this primitive. Examples include MIN operator that returns the minimum of 
its operands, logical AND, logical OR, ordinary addition, and subtraction. 

SpGEMM computes the sparse product C = AB, where the input matrices 
A G 'gMxK -g g gKxN both sparse. It is a common operation for operat- 
ing on large graphs, used in graph contraction, peer pressure clustering, recursive 



While A = A©BorA = AB may also be considered as mutator operations, these are just 
special cases when the output is the same as one of the inputs. 
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formulations of all-pairs shortest path algorithms, and breadth-first search from 
multiple source vertices. 

The computation for matrix multiplication can be organized in several ways, 
leading to different formulations. One common formulation is the inner product 
formulation, as shown in Algorithm 13.1. In this case, every element of the prod- 
uct C{i,j) is computed as the dot product of a row i in A and a column j in B. 
Another formulation of matrix multiplication is the outer product formulation (Al- 
gorithm 13.2). The product is computed as a sum of n rank-one matrices. Each 
rank-one matrix is computed as the outer product of column fc of A and row k 
of B. 

Algorithm 13.1. Inner product matrix multiply. 

Inner product formulation of matrix multiplication. 

C : rS(MxN) ^ InnerProduct-SpGEMM(A : M'5(*^><'^), B : rS{KxN)-^ 

1 for i = 1 to M 

2 do for j = 1 to TV 

3 doC(z,j) = A(z,:).B(:,j) 

Algorithm 13.2. Outer product matrix multiply. 

Outer product formulation of matrix multiplication. 

C : rS(m>^n) ^ OuterProduct-SpGEMM(A : M^(*^><-^'), B : rS{KxN)^ 

1 C = 0 

2 for fc = 1 to if 

3 do C = C-f A(:,fc) •B(fc,:) 

SpGEMM can also be set up so that A and B are accessed by rows or columns, 
computing one row/column of the product C at a time. Algorithm 13.3 shows the 
column wise formulation where column j of C is computed as a linear combination 
of the columns of A as specified by the nonzeros in column j of B. Figure 13.2 shows 
the same concept graphically. Similarly, for the row wise formulation, each row i 
of C is computed as a linear combination of the rows of B specified by nonzeros in 
row i of A as shown in Algorithm 13.4. 

Algorithm 13.3. Column wise matrix multiplication. 

Column wise formulation of matrix multiplication. 

C : RS{mxn) ^ ColumnWise-SpGEMM(A : RS{Mxk) -q. ^siKxN)-^ 

1 for j = 1 to TV 

2 do for fc where B(fc, j) 7^ 0 

3 doC(:,j) = C(:,j)+A(:,fc).B(fc,j) 
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Figure 13.2. Multiply sparse matrices column by column. 

Multiplication of sparse matrices stored by columns. Columns of A are 
accumulated as specified by the nonzero entries in a column of B using 
a sparse accumulator (SPA). The contents of the SPA are stored in a 
column of C once all required columns are accumulated. 



Algorithm 13.4. Row wise matrix multiply. 

Row wise formulation of matrix multiplication. 

C : rS{MxN) ^ RowWise-SpGEMM(A : R^(*^x-f^),B : RS{KxNy-j 

1 for i = 1 to M 

2 do for k where A{i, k) ^ 0 

3 doC(i,:) = C(i,:) + A(i,fc)-B(fc,:) 



13.3 Triples 

The simplest way to represent a sparse matrix is the triples (or coordinates) format. 
For each A{i,j) ^ 0, the triple («, j, A{i,j)) is stored in memory. Each entry in the 
triple is usually stored in a different array and the whole matrix A is represented as 
three arrays A.I (row indices), A.J (column indices), and A.V (numerical values), as 
illustrated in Figure 13.3. These separate arrays are called "parallel arrays" by Duff 
and Reid (see [Duff Reid 1979]) but we reserve "parallel" for parallel algorithms. 
Using 8-byte integers for row and column indices, storage cost is 8 + 8 + 8 = 24 
bytes per nonzero. 

Modern programming languages offer easier ways of representing an array 
of tuples than using three separate arrays. An alternative implementation might 
choose to represent the set of triples as an array of records (or structs). Such an im- 
plementation might improve cache performance, especially if the algorithm accesses 
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Figure 13.3. Triples representation. 

Matrix A (left) and its unordered triples representation (right). 



elements of the same index from different arrays. This type of cache optimization is 
known as array merging [Kowarschik & WciQ 2002]. Some programming languages, 
such as Python and Haskell, even support tuples as built-in types, and C++ in- 
cludes tuples support in its standard library [Beclvcr 2000]. In this section, we use 
the established notation of three separate arrays (A.I, A.J, A.V) for simplicity, but 
an implementer should keep in mind the other options. 

This section evaluates the triples format under different levels of ordering. 
Unordered triples representation imposes no ordering constraints on the triples. 
Row ordered triples keep nonzeros ordered with respect to their row indices only. 
Nonzeros within the same row are stored arbitrarily, irrespective of their column 
indices. Finally, row-major order keeps nonzeros ordered lexicographically first 
according to their row indices and then according to their column indices to break 
ties. It is also possible to order with respect to columns instead of rows, but we 
analyze the row-based versions. Column-ordered and column-major ordered triples 
are similar. RAM and I/O complexities of key primitives for unordered and row 
ordered triples are listed in Tables 13.1 and 13.2. 

A theoretically attractive fourth option is to use hashing and store triples 
in a hash table. In the case of SpGEMM and SpAdd, dynamically managing the 
output matrix is computationally expensive since dynamic perfect hashing does 
not yield high performance in practice [Mchlhorn & Nahcr 1999] and requires 35-/V 
space [Dictzfclbingcr ct al. 1994]. A recently proposed dynamic hashing method 
called Cuckoo hashing is promising. It supports queries in worst-case constant 
time and updates in amortized expected constant time, while using only 2A'' space 
[Pagh & Rodlcr 2004] . Experiments show that it is substantially faster than existing 
hashing schemes on modern architectures like Pentium 4 and IBM Cell [Ross 2007]. 
Although hash-based schemes seem attractive, especially for SpAsgn and SpRef 
primitives [Aspniis et al. 2006], further research is required to test their efficiency 
for sparse matrix storage. 



13.3.1 Unordered triples 

The administrative overhead of the triples representation is low, especially if the 
triples are not sorted in any order. With unsorted triples, however, there is no 
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Table 13.1. Unordered and row ordered RAM complexities. 

Memory access complexities of key primitives on unordered and row 
ordered triples. 





Unordered 


Row Ordered 


SpRef 


0{nnz{A)) 


0{\gnnz{A) + nnz[A{i, :))){ f j*'-^^ 

I A(i,:) 

0{nnz{A)){A{:,3) 


SpAsgn 


0{nnz{A.)) + 0{nnz(B)) 


0{nnz{A)) + 0(nnz{B)) 


SpMV 


0(nnz{A)) 


0{nnz(A)) 


SpAdd 


0(nnz(A) + nnz(B)) 


0{nnz{A) + nnz{'Q)) 


SpGEMM 


0(nnz(A) + nn2;(B)+ flops) 


0(nnz(A)+flops) 



Table 13.2. Unordered and row ordered I/O complexities. 

Input/Output access complexities of key primitives on unordered and 
row ordered triples. 





Unordered 


Row Ordered 


SpRef 


0{scan{A)) 


0(lg nnz{A)+scan{A{i,:))i ^^^'^^ 

I A(i,:) 

0{scan(A)){A{:,j) 


SpAsgn 


0{scan{A) + scan(B)) 


0{scan{A) + 5con(B)) 


SpMV 


0{nnz{A)) 


0{nnz{A)) 


SpAdd 


0{nnz{A) + nnz{B)) 


0{nnz{A) + nn«(B)) 


SpGEMM 


0 ( nnz ( A) + nnz (B) + flops) 


0{mm{nnz(A) + flops, 
scan{A) lg(nn2(B))+flops}) 



spatial locality when accessing nonzeros of a given row or column. In the worst 

case, all indexing operations might require a complete scan of the data structure. 
Therefore, SpRef has 0{nnz{K)) RAM complexity and 0{scan{A)) I/O complexity. 

SpAsgn is no faster, even though insertions take only constant time per el- 
ement. In addition to accessing all the elements of the right-hand side matrix 
A, SpAsgn also invalidates the existing nonzeros that need to be changed in the 
left-hand side matrix B. Just finding those triples takes time proportional to the 
number of nonzeros in B with unordered triples. Thus, RAM complexity of SpAsgn 
is 0{nnz{A) + nnz{R)) and its I/O complexity is 0{scan(A) + scan{R)). A simple 
implementation achieving these bounds performs a single scan of B, outputs only 

*A procedure exploits spatial locality if data that are stored in nearby memory locations are 
likely to be referenced close in time. 
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the nonassigned triples (e.g., for B{:,k) — A, those are the triples (z,j, B(i,j)) 
where j ^ k), and finally concatenates the nonzeros in A to the output. 

SpMV has full spatial locality when accessing the elements of A because the 
algorithm scans all the nonzeros of A in the exact order that they are stored. 
Therefore, 0{scan{A)) cache misses are taken for granted as compulsory misses.* 
Although SpMV is optimal in the RAM model without any ordering constraints, 
its cache performance suffers, as the algorithm cannot exploit any spatial locality 
when accessing vectors x and y. 

In considering the cache misses involved, for each triple (i, j, A(z, j)), a ran- 
dom access to the jth component of x is required, and the result of the element-wise 
multiplication A{i,j) ■ x(j) must be written to the random location y(i). Assump- 
tion 2 implies that the fast memory is not big enough to hold the dense arrays x and 
y. Thus, we make up to two extra cache misses per flop. These indirect memory 
accesses can be clearly seen in the Triples-SpMV code shown in Algorithm 13.5, 
where the values of A.l(fc) and A.J(fc) may change in every iteration. Consequently, 
I/O complexity of SpMV on unordered triples is 

nnz{A)/L + 2 nnz{A) = 0{nnz{A)) (13.1) 

Algorithm 13.5. Triples matrix vector multiply. 

Operation y = Ax using triples. 

y : M*^ = Triples-SpMV(A : E'5(*^x^),x : R^) 

1 y = 0 

2 for A; = 1 to nnz{A) 

3 do y(A.I(fc)) = y(A.I(fc)) + A.V(/c) • x(A.J(fc)) 

The SpAdd algorithm needs to identify all pairs such that A{i,j) ^ 0 and 
B(i,j) ^ 0, and to add their values to create a single entry in the resulting matrix. 
This step can be accomplished by first sorting the nonzeros of the input matrices and 
then performing a simultaneous scan of sorted nonzeros to sum matching triples. 
Using linear time counting sort, SpAdd is fast in the RAM model with 0{nnz{A) + 
nnz(B)) complexity. 

Counting sort, in its naive form, has poor cache utilization because the total 
size of the counting array is likely to be bigger than the size of the fast memory. 
Sorting the nonzeros of a sparse matrix translates into one cache miss per nonzero 
in the worst case. Therefore, the complexity of SpAdd in the I/O model becomes 
0{nnz{A) + nKz(B)). The number of cache misses can be decreased by using cache 
optimal sorting algorithms (see [At^giu w;)] Vitto]- H)SS]), but such algorithms are 
comparison based. They do O(nlgn) work as opposed to linear work. Rahman and 
Raman [Rahman & Raman 2U(J()] gave a counting sort algorithm that has better 
cache utilization in practice than the naive algorithm, and still does linear work. 

SpGEMM needs fast access to columns, rows, or a given particular element, 
depending on the algorithm. One can also think of A as a table of i's and k's 

Assuming that no explicit data prefetching mechanism is used. 
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and B as a table of k's and j's; then C is their join on k. This database anal- 
ogy [Kotlyar c^t al. 1997] may lead to alternative SpGEMM implementations based 
on ideas from databases. An outer product formulation of SpGEMM on unordered 
triples has three basic steps (a similar algorithm for general sparse tensor multipli- 
cation is given by Bader and Kolda [Badcr <k: Kolda, 2007]): 

1. For each k G {!,..., K}, identify the set of triples that belongs to the kth 
column of A and the fcth row of B. Formally, find A(:, k) and B(/c, :). 

2. For each k e {1, . . . , K}, compute the Cartesian product of the row indices of 
A{:,k) and the column indices of B(fc, :). Formally, compute the sets = 
{A(:,A:).l}x{B(fc,:).J}. 

3. Find the union of all Cartesian products, summing up duplicates during set 
uniom C = Ufc6{i,...,/f}Cfe. 

Step 1 of the algorithm can be efficiently implemented by sorting the triples 
of A according to their column indices and the triples of B according to their row 
indices. Computing the Cartesian products in step 2 takes time 

K 

nnz{A{:,k)) ■ nnz{B{k, :)) = flops (13.2) 

fe=i 

Finally, summing up duplicates can be done by lexicographically sorting the ele- 
ments from sets C^, which has a total running time of 

0{sort{nnz{A)) + sort{nnz{B)) + flops + sort(flops)) (13.3) 

As long as the number of nonzeros is more than the dimensions of the matrices 
(Assumption 1), it is advantageous to use a linear time sorting algorithm instead of a 
comparison-based sort. Since a lexicographical sort is not required for finding A(:, fc) 
or B(A:, :) in step 1, a single pass of linear time counting sort [Coriuon et al. 2001] 
suffices for each input matrix. However, two passes of linear time counting sort are 
required in step 3 to produce a lexicographically sorted output. RAM complexity 
of this implementation turns out to be 

nnz{A) + nnz{B) + 3 • flops = 0{nnz{A) + nnz{B) + flops) (13-4) 

However, due to the cache-inefficient nature of counting sort, this algorithm 
makes 0{nnz{A) + nnz{B) + flops) cache misses in the worst case. 

Another way to implement SpGEMM on unordered triples is to iterate through 
the triples of A. For each (i, j, A(i, j)), we find B(:,j) and multiply A{i,j) with 
each nonzero in B(:,j). The duplicate summation step is left intact. The time this 
implementation takes is 

nnz{A) ■ nnz{B) + 3 • flops = 0{nnz{A) ■ nnz{B)) (13.5) 

The term flops is dominated by the term nnz{A)-nnz(B) according to Theorem 13.1. 
Therefore, the performance is worse than for the previous implementation that sorts 
the input matrices first. 
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Theorem 13.1. For all matrices A and B, flops(AB) < nnz{A) ■ nnz{'B) 
Proof. Let the vector of column counts of A be 

a — (ai, 02, . . . , flfe) = {nnz(A(:, 1), nnz{A{:, 2)), . . . , nnz{A{:, k)) 
and the vector of row counts of B 

b = (6i, 62, . . . , 6fe) = {nnz{B{l, :), nnz(B(2, :)), . . . , nnz{B{k, :)). 

Note that flops = a^b = o-i^j: and X^i/j ^^^i > 0 as a and b are non- 

negative. Consequently, 

nnz{A) ■ nnz{B) = i^^ak \ ■ i^^bk \ ^ ^ ttibj + ^ a.ibj 
\k=i J \k=i J \t=j I yz^j 



— 5Z '^'-^^ ~ ~ flops □ 



It is worth noting that both implementations of SpGEMM using unordered 
triples have 0{nnz{A) + nnz{B) + flops) space complexity, due to the intermediate 
triples that are all present in the memory after step 2. Ideally, the space complexity 
of SpGEMM should be 0{nnz{A) + nnz{B) + nnz{C)), which is independent of 
flops. 

13.3.2 Row ordered triples 

The second option is to keep the triples sorted according to their rows or columns 
only. We analyze the row ordered version; column order is symmetric. This section 
is divided into three subsections. The first one is on indexing and SpMV. The 
second one is on a fundamental abstract data type that is used frequently in sparse 
matrix algorithms, namely the sparse accumulator (SPA). The SPA is used for 
implementing some of the SpAdd and SpGEMM algorithms throughout the rest of 
this chapter. Finally, the last subsection is on SpAdd and SpGEMM algorithms. 



Indexing and SpMV with row ordered triples 

Using row ordered triples, indexing still turns out to be inefficient. In practice, even 
a fast row access cannot be accomplished since there is no efficient way of spotting 
the beginning of the ith row without using an index. Row wise referencing can 
be done by performing binary search on the whole matrix to identify a nonzero 
belonging to the referenced row, and then by scanning in both directions to find 
the rest of the nonzeros belonging to that row. Therefore, SpRef for A(i, :) has 
0(lg nnz{A) + nnz{A{i, :)) RAM complexity and 0(lg nnz{A) + scan{A{i, :))) I/O 

That is a drawback of the triples representation in general. The compressed sparse storage 
formats described in Section 13. 1 provide efficient indexing mechanisms for either rows or columns. 
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complexity. Element-wise referencing also has the same cost, in both models. Col- 
umn wise referencing, on the other hand, is as slow as it was with unordered triples, 
requiring a complete scan of the triples. 

SpAsgn might incur excessive data movement, as the number of nonzeros in 
the left-hand side matrix B might change during the operation. For a concrete 
example, consider the operation B(i, :) = A where nnz{'B{i, :)) 7^ nnz{A) before 
the operation. Since the data structure needs to keep nonzeros with increasing row 
indices, all triples with row indices bigger than i need to be shifted by distance 
I nnz{A) — nnz{'B{i, :))|. 

SpAsgn has RAM complexity 0{nnz{A)) + nnzCB)) and I/O complexity 
0{scan{A) + sca«(B)), where B is the left-hand side matrix before the operation. 
While implementations of row wise and element-wise referencing are straightfor- 
ward, column wise referencing (B(:,i) — A) seems harder as it reduces to a re- 
stricted case of SpAdd. The restriction is that B e §^^^ has, at most, one nonzero 
in a given row. Therefore, a similar scanning-based implementation suffices. 

Row ordered triples format allows an SpMV implementation that makes, 
at most, one extra cache miss per flop. The reason is that references to vec- 
tor y show good spatial locality: they are ordered with monotonically increas- 
ing values of A.l(fc), avoiding scattered memory referencing on vector y. How- 
ever, accesses to vector x are still irregular as the memory stride when accessing 
X (|A.J(fc + 1) — A.J(fc)l) might be as big as the matrix dimension N. Memory 
strides can be reduced by clustering the nonzeros in every row. More formally, this 
clustering corresponds to reducing the bandwidth of the matrix, which is defined 
as /3(A) = niax{|i — j| : A{i,j) ^ 0}. Toledo experimentally studied different 
methods of reordering the matrix to reduce its bandwidth, along with other op- 
timizations like blocking and prefetching, to improve the memory performance of 
SpMV [Toledo 1997]. Overall, row ordering does not improve the asymptotic I/O 
complexity of SpMV over unordered triples, although it cuts the cache misses by 
nearly half. Its I/O complexity becomes 

nnz{A)/L + N/L + nnz{A) = 0{nnz{A)) (13.6) 
The sparse accumulator 

Most operations that output a sparse matrix generate it one row at a time. The cur- 
rent active row is stored temporarily on a special structure called the sparse accumu- 
lator (SPA) [ iilbert et al. 199"J] (or expanded real accumulator [Pissanctsky 1984]). 
The SPA helps merging unordered lists in linear time. 

There are different ways of implementing the SPA as it is an abstract data 
type not a concrete data structure. In our SPA implementation, w is the dense 
vector of values, b is the boolean dense vector that contains "occupied" flags, and 
LS is the list that keeps an unordered list of indices, as Gilbert, Moler and Schreiber 
described [GillxTt et al. 1992]. 

Scatter-SPA function, given in Algorithm 13.6, adds a scalar {value) to a 
specific position (pos) of the SPA. Scattering is a constant time operation. Gath- 
ering the SPA's nonzeros to the output matrix C takes 0{nnz{SPA)) time. The 
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pseudocode for the Gather-SPA is given in Algorithm 13.7. It is crucial to initialize 
SPA only once at the beginning as this takes 0{N) time. Resetting it later for the 
next active row takes only 0(77.n2:(SPA)) time by using LS to reach all the nonzero 
elements and resetting only those indices of w and b. 

Algorithm 13.6. Scatter SPA. 

Scatters/accumulates the nonzeros in the SPA. 

SCATTER-SPA(SPA, value, pos) 

1 if (SPA.b(pos) = 0) 

2 then SPA.w(pos) <— value 

3 SPA.b(pos) ^ 1 

4 Insert(SPA.LS, pos) 

5 else SPA.w(pos) ^ SPA.w(pos) + value 

Algorithm 13.7. Gather SPA. 

Gathers/outputs the nonzeros in the SPA. 

nzi = Gather-SPA(SPA, val, col, nzcur) 

1 cptr i- head{SPA.lS) 

2 7121 0 t> number of nonzeros in the ith row of C 

3 v^rhile cptr ^ nil 

4 do 

5 col(n2CMr+n) -s— element{cptr) t> Set column index 

6 'val(nzcur +n) -s— SPA.w {element(cptr)) t> Set value 

7 nzi -s— nzi +1 

8 ADVANCE(cptr) 

The cost of resetting the SPA can be completely avoided by using the multiple 
switch technique (also called the phase counter technique) described by Gustavson 
(see [Gustavson 1976, Gustavson 1997]). Here, b becomes a dense switch vector 
of integers instead of a dense boolean vector. For computing each row, we use a 
different switch value. Every time a nonzero is introduced to position pos of the 
SPA, we set the switch to the current active row index (SPA.b(pos) — i). During 
the computation of subsequent rows j = {i + 1, . . . , M }, the switch value being less 
than the current active row index (SPA.b(pos) < j) means that the position pos of 
the SPA is "free." Therefore, the need to reset b for each row is avoided. 

SpAdd and SpGEMM with row ordered triples 

Using the SPA, we can implement SpAdd with 0{nnz{A) + nnzCB)) RAM com- 
plexity. The full procedure is given in Algorithm 13.8. The I/O complexity of 
this SpAdd implementation is also 0{nnz{A) + nnz(B)) because for each nonzero 
scanned from inputs, the algorithm checks and updates an arbitrary position of the 
SPA. From Assumption 2, these arbitrary accesses are likely to incur cache misses 
every time. 
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Algorithm 13.8. Row ordered matrix add. 

Operation C = A 0 B using row ordered triples. 

C : RS(MxN) ^ RowTriples-SpAdd(A : M'5(*^x^),B : rS(MxN]^ 

1 Set-SPA(SPA) [> Set w = 0, b = 0 and create empty list LS 

2 ka kb ^ kc ^ I [> Initialize current indices to one 

3 for i ^ 1 to M 



4 do while (ka < nnz{A) and A.\{ka) = i) 

5 do SCATTER-SPA(SPA, A.V(/ca), A.J(fca)) 

6 fca -i— fca + 1 

7 while {kb < nnz{B) and B.l(fc6) = i) 

8 do ScATTER-SPA(SPA,B.V(fc6),B.J(fc&)) 

9 kb^kb+l 

10 nznew 4- Gather-SPA(SPA, C.V, C.J, fcc) 

11 for ^ 0 to nznew —1 

12 do C.l(fcc + j) i \> Set row index 

13 kc ^ kc + nznew 

14 Reset-SPA(SPA) \> Reset w = 0, b = 0 and empty LS 



It is possible to implement SpGEMM by using the same outer product formula- 
tion described in Section 13.3.1, with a slightly better asymptotic RAM complexity 
of 0{nnz{A) + flops), as the triples of B are already sorted according to their row 
indices. Instead, we describe a row wise implementation, similar to the CSR-based 
algorithm described in Section 13.4. Because of inefficient row wise indexing sup- 
port of row ordered triples, however, the operation count is higher than the GSR 
version. A SPA of size N is used to accumulate the nonzero structure of the current 
active row of C. A direct scan of the nonzeros of A allows enumeration of nonzeros 
in A(i, :) for increasing values of i € {1, ... , M}. Then, for each triple {i,k,A{i,k)) 
in the ith row of A, the matching triples (fc, j, B(A;, j)) of the fcth row of B need 
to be found using the SpRef primitive. This way, the nonzeros in C(«, :) are ac- 
cumulated. The whole procedure is given in Algorithm 13.9. Its RAM complexity 
is 

{nnz(B(i, :)) \g{nnz{B))) + flops = 0{nnz{A) \g{nnz{B)) + flops) (13.7) 

A(i,fc)^0 

where the lg(?zn2(B)) factor per each nonzero in A comes from the row wise SpRef 
operation in line 5. Its I/O complexity is 

0{scan{A) \g{nnz{B)) + flops) (13.8) 

While the complexity of row wise implementation is asymptotically worse than 
the outer product implementation in the RAM model, it has the advantage of using 
only 0{nnz{C)) space as opposed to the O(flops) space used by the outer product 
implementation. On the other hand, the I/O complexities of the outer product 
version and the row wise version are not directly comparable. Which one is faster 
depends on the cache line size and the number of nonzeros in B. 
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Algorithm 13.9. Row ordered matrix multiply. 

Operation C = AB using row ordered triples. 

C : RS(^f>'N) ^ RowTriples-SpGEMM(A : R^(^^><-^), B : rSIKxN)-^ 

1 Set-SPA(SPA) |> Set w = 0, b = 0 and create empty list LS 

2 /ca -s— fee -s— 1 t> Initialize current indices to one 

3 for z 1 to M 



4 do while (ka < nnz{A) and A.l(fca) = i) 

5 do BR ^ B(A.J(fca), :) C> Using SpRef 

6 for fc6 ^ 1 to nn2(BR) 

7 do mke ^ A.NUM(fca) • BR.NUM(fc6) 

8 SCATTER-SPA(SPA, value, BR.J(fc6)) 

9 ka ^ ka + 1 

10 nznew ^ Gather-SPA(SPA, C.V, C.J, fee) 

11 for J <s— 0 to nznew —1 

12 do C.l(fcc + j) -i— i t> Set row index 

13 kc -^r- kc + nznew 

14 Reset-SPA(SPA) t> Reset w = 0, b = 0 and empty LS 



13.3.3 Row-major ordered triples 

We now consider the third option of storing triples in lexicographic order, either 
in column-major or row-major order. Once again, we focus on the row-oriented 
scheme in this section. RAM and I/O complexities of key primitives for row-major 
ordered and CSR are hsted in Tables 13.3 and 13.4. 

In order to reference a whole row, binary search on the whole matrix, followed 
by a scan on both directions, is used, as with row ordered triples. As the nonzeros 
in a row are ordered by column indices, it seems there should be a faster way to 
access a single element than the method used on row ordered triples. A faster way 



Table 13.3. Row-major ordered RAM complexities. 

RAM complexities of key primitives on row-major ordered triples and 
CSR. 





Row-Major Ordered Triples 


CSR 


SpRef 


0(lg nnz(A) + Ig nnz{A{i, :)){ A{i,j) 
0(lg nnz{A) + nnz{A{i, :)){ A{i, :) 
Oinnz{A)){ A{:,j) 


0{lgnnz{A{i,:)){A{i,j) 
0{rmz{A{i,:)){A{i,:) 
0(™z(A)){A(:,j) 


SpAsgn 


0{nnz{A)) + 0{rmz{B)) 


0{nnz{A) + nnz{B)) 


SpMV 


0{rmz{A)) 


0{nnz{A)) 


SpAdd 


0{nnz{A) + nnz{B)) 


0{nnz{A) + nnz{B)) 


SpGEMM 


O (nnz (A) -1- flops) 


0(nnz (A) + tiops) 
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Table 13.4. Row-major ordered I/O complexities. 

I/O complexities of key primitives on row-major ordered triples and 
GSR. 





Row-Major Ordered Triples 


CSR 


SpRef 


0(lg nnz(A) + search{A{i, :)){ A{i,j) 
0{lgnnz{A) + scan{A{i, :)){ A{i, :) 
0{scan{A)){A{:,j) 


0{search{A{i,:))){A{i,j) 
0{scan{A{i,:))){A{i,:) 
0{scan(A)){A{:,j) 


SpAsgn 


0{scan{A) + scan(B)) 


0{scan{A) + scan(B)) 


SpMV 


0{nnz{A)) 


0{nnz{A)) 


SpAdd 


0{scan{A) + scan(B)) 


0{scan{A) + scan(B)) 


SpGEMM 


0{mm{nnz{A) + flops, 
scan{A) lg(rmz(B))+flops}) 


0(scan[A) + flops) 



indeed exists, but ordinary binary search would not do it because the beginning 
and the end of the zth row is not known in advance. The algorithm has three steps: 

1. Spot a triple A{i,j)) that belongs to the ith row by doing binary search 
on the whole matrix. 

2. From that triple, perform an unbounded binary search [Manber 1989] on both 
directions. In an unbounded search, the step length is doubled at each iter- 
ation. The search terminates at a given direction when it hits a triple that 
does not belong to the ith row. Those two triples (one from each direction) 
become the boundary triples. 

3. Perform ordinary binary search within the exclusive range defined by the 
boundary vertices. 

The number of total operations is 0{\gnnz{A) + Ig nnz {A{i , :)) . An example is 
given in Figure 13.4, where A(12, 16) is indexed. 

While unbounded binary search is the preferred method in the RAM model, 
simple scanning might be faster in the I/O model. Searching an element in an or- 
dered set of iV elements can be achieved with Q{\ogj^ N) cost in the I/O model, using 
B-trees (see [Bayer & McCreight 1972]). However, using an ordinary array, search 
incurs \gN cache misses. This may or may not be less than scan{N). Therefore, 
we define the cost of searching within an ordered row as follows: 

search{A{i, :)) = min{lg nnz{A{i, :)), scan{A{i, :))} (13.9) 

For column wise referencing as well as for SpAsgn operations, row-major or- 
dered triples format does not provide any improvement over row ordered triples. 

In SpMV, the only array that does not show excellent spatial locality is x, 
since A.I, A.J, A. V, and y are accessed with mostly consecutive, increasing index 
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step 3 



A.I 


10 


11 


11 


12 


12 


12 


12 


12 


12 


12 


12 


12 


13 


13 


13 


A.J 


18 


5 


12 


2 


3 


4 


7 


10 


11 


13 


16 


18 


2 


9 


11 


A.V 


31 


90 


76 


57 


89 


19 


96 


65 


44 


34 


28 


28 


21 


11 


54 



Step 2 



Figure 13.4. Indexing row-major triples. 

Element-wise indexing of A(12, 16) on row-major ordered triples. 



values. Accesses to x are also with increasing indices, which is an improvement over 
row ordered triples. However, memory strides when accessing x can still be high, 
depending on the number of nonzeros in each row and the bandwidth of the matrix. 
In the worst case, each access to x might incur a cache miss. 

Bender et al. came up with cache-efficient algorithms for SpMV, using the 
column-major layout, that have an optimal number of cache misses [Bender ct al. 2007]. 
From a high-level view, their method first generates all the intermediate triples of 
y, possibly with repeating indices. Then, the algorithm sorts those intermediate 
triples with respect to their row indices, performing additions on the triples with 
same row index as they occur. I/O optimality of their SpMV algorithm relies on the 
existence of an I/O optimal sorting algorithm. Their complexity measure assumes 
a fixed k number of nonzeros per column, leading to I/O complexity of 

0{scaniA)l0,,,,-^^) (13.10) 

SpAdd is now more efficient even without using any auxiliary data structure. 
A scanning-based array-merging algorithm is sufficient as long as we do not forget 
to sum duplicates while merging. Such an implementation has 0{nnz{A) + nn2(B)) 
RAM complexity and 0{scan{A) + scan(B)) I/O complexity.* 

Row-major ordered triples allow outer product and row wise SpGEMM im- 
plementations at least as efficiently as row ordered triples. Indeed, some finer im- 
provements are possibly by exploiting the more specialized structure. In the case of 
row wise SpGEMM, a technique called finger search [Brodal 2005] can be used to 
improve the RAM complexity. While enumerating all triples {i,k,A) G A(z, :), they 
are naturally sorted with increasing k values. Therefore, accesses to B(fc, :) are also 
with increasing k values. Instead of restarting the binary search from the beginning 
of B, one can use fingers and only search the yet unexplored subsequence. Note 

*These bounds are optimal only if nnz(A) = 0(nnz(B)); see [Brown <k Tarjau 1979]. 
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that finger search uses the unbounded binary search as a subroutine when searching 
the unexplored subsequence. Row wise SpGEMM using finger search has a RAM 
complexity of 



which is asymptotically faster than the 0{nnz{A) lg(nn2(B)) + flops) cost of the 
same algorithm on row ordered triples. 

Outer product SpGEMM can be modified to use only 0{nnz{C)) space during 
execution by using multiway merging [liuluc cV' Gilbert 2(JIJ8I)]. However, this comes 
at the price of an extra Ig k factor in the asymptotical RAM complexity where k is 
the number of indices i for which A(:, i) 7^ 0 and B(i, :) 7^ 0. 

Although both of these refined algorithms are asymptotically slower than the 
naive outer product method, they might be faster in practice because of the cache 
effects and difference in constants in the asymptotic complexities. Further research 
is required in algorithm engineering of SpGEMM to find the best performing algo- 
rithm in real life. 



The most widely used storage schemes for sparse matrices are compressed sparse 
column (GSG) and compressed sparse row (GSR). For example, Matlab uses GSG 
format to store its sparse matrices [Gill>ert et al. 1992]. Both are dense collections of 
sparse arrays. We examine GSR, which is introduced by Gustavson [Gustavson 1972] 
under the name of sparse row wise representation; GSG is symmetric. 

GSR can be seen as a concatenation of sparse row arrays. On the other hand, 
it is also very close to row ordered triples with an auxiliary index of size Q{N). 
In this section, we assume that nonzeros within each sparse row array are ordered 
with increasing row indices. This is not a general requirement though. Tim Davis 
GSparse package [Davis ct al. 2006], for example, does not impose any ordering 
within the sparse arrays. 

1 3.4.1 CSR and adjacency lists 

In principle, GSR is almost identical to the adjacency list representation of a di- 
rected graph [Taijaii 1972]. In practice, however, it has much less overhead and 
much better cache efficiency. Instead of storing an array of linked lists as in the 
adjacency list representation, GSR is composed of three arrays that store whole 
rows contiguously. The first array (IR) of size A'l + 1 stores the row pointers as 
explicit integer values, the second array (JC) of size nnz stores the column indices, 
and the last array (NUM) of size nnz stores the actual numerical values. Observe 
that column indices stored in the JC array indeed come from concatenating the edge 
indices of the adjacency lists. Following the sparse matrix/graph duality, it is also 
meaningful to call the first array the vertex array and the second array the edge 
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Figure 13.5. CSR format. 

Adjacency list (left) and CSR (right) representations of matrix A. 



array. The vertex array holds the offsets to the edge array, meaning that the nonze- 
ros in the ith row are stored from NUM(IR(j)) to NUM(IR(i + 1) — 1), and their 
respective positions within that row are stored from JC(IR(i)) to JC(IR(i + 1) — 1). 
Also note that JC(i) = ]C{i + 1) means there are no nonzeros in the ith row. 

Figure 13.5 shows the adjacency list and CSR representations of matrix A. 
While the arrows in the adjacency-based representation are actual pointers to mem- 
ory locations, the arrows in CSR are not. Any data type that supports unsigned 
integers fulfills the desired purpose of holding the offsets to the edge array. 

The efficiency advantage of the CSR data structure compared with the adja- 
cency list can be explained by the memory architecture of modern computers. In 
order to access all the nonzeros in a given row i, which is equivalent to traversing all 
the outgoing edges of a given vertex Vi, CSR makes at most [7177.2; (A (i, '■))/L] cache 
misses. A similar access to the adjacency list representation incurs nnz{A(i, :)) 
cache misses in the worst case, worsening as the memory becomes more and more 
fragmented. In an experiment published in 1998, an array-based representation 
was found to be 10 times faster to traverse than a linked-list-based representa- 
tion [Black ct al. 1998]. This performance gap is due to the high cost of pointer 
chasing that happens frequently in linked data structures. The efficiency of CSR 
comes at a price though: introducing new nonzero elements or deleting a nonzero 
element is computationally inefficient and best avoided [Gil1)(>rt vt al. 1992]. There- 
fore, CSR is best suited for representing static graphs. All the key primitives but 
SpAsgn work on static graphs. 



13.4.2 CSR on key primitives 

Contrary to triples storage formats, CSR allows constant-time random access to 
any row of the matrix. Its ability to enumerate all the elements in the ith row 
with 0{nnz{A{i, :)) RAM complexity and 0{scan{A{i, :)) I/O complexity makes it 
an excellent data structure for row wise SpRef. Element-wise referencing takes, at 
most, 0(lg nnz{A{i, :)) time in the RAM model as well as the I/O model, using a 
binary search. Considering column wise referencing, however, CSR does not provide 
any improvement over the triples format. 
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On the other hand, even row wise SpAsgn operations are inefficient if the 
number of elements in the assigned row changes. In that general case, 0{nnz(B)) 
elements might need to be moved. This is also true for column wise and element-wise 
SpAsgn as long as not just existing nonzeros are reassigned to new values. 

The code in Algorithm 13.10 shows how to perform SpMV when matrix A is 
represented in CSR format. This code and SpMV with row-major ordered triples 
have similar performance characteristics except for a few subtleties. When some 
rows of A are all zeros, those rows are effectively skipped in row-major ordered 
triples, but still need to be examined in CSR. On the other hand, when M <C nnz, 
CSR has a clear advantage since it needs to examine only one index (A.JC(fc)) per 
inner loop iteration while row-major ordered triples needs to examine two (A. I (A;) 
and A.J(fc)). Thus, CSR may have up to a factor of two difference in the number of 
cache misses. CSR also has some advantages over CSC when the SpMV primitive 
is considered (especially in the case of y = y + Ax) , as experimentally shown by 
Vuduc [Vuduc 2003]. 

Algorithm 13.10. CSR matrix vector multiply. 

Operation y — Ax using CSR. 

y : M^^ = CSR-SpMV(A : M^(^^)><^,x : M^) 

1 y-0 

2 for i = 1 to M 

3 do for k = A.IR(i) to A.\R{i + 1) - 1 

4 do y{i) = y{i) + A.NUM(fc) • x(A.JC(fc)) 

Blocked versions of CSR and CSC try to take advantage of clustered nonzeros 
in the sparse matrix. While blocked CSR (BCSR) achieves superior performance 
for SpMV on matrices resulting from finite element meshes [Vuduc 2003], mostly 
by using loop unrolling and register blocking, it is of little use when the matrix 
itself does not have its nonzeros clustered. Pinar and Heath [Pinar & Heath 1999] 
proposed a reordering mechanism to cluster those nonzeros to get dense subblocks. 
However, it is not clear whether such mechanisms are successful for highly irregular 
matrices from sparse real- world graphs. 

Except for the additional bookkeeping required for getting the row pointers 
right, SpAdd can be implemented in the same way as is done with row-major 
ordered triples. Luckily, the extra bookkeeping of row pointers does not affect the 
asymptotic complexities. 

One subtlety overlooked in the SpAdd implementations throughout this chap- 
ter is management of the memory required by the resulting matrix C. We implicitly 
assumed that the data structure holding C has enough space to accommodate all 
of its elements. Repeated doubling of memory whenever necessary is one way of 
addressing this issue. Another conservative way is to reserve nnz{A) + nKz(B) 
space for C at the beginning of the procedure and shrink any unused portion after 
the computation, right before the procedure returns. 

The efficiency of accessing and enumerating rows in CSR makes the row 
wise SpGEMM formulation, described in Algorithm 13.4, the preferred matrix 
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multiplication formulation. An efficient implementation of the row wise SpGEMM 
using CSR was first given by Gustavson [Gustavson 1978]. It had a RAM complex- 
ity of 

0{M + N + nnz{A) + flops) = 0{nnz{A) + flops) (13.12) 

where the equality follows from Assumption 1. Recent column wise implementa- 
tions with similar RAM complexities are provided by Davis in his CSparse soft- 
ware [Davis et al. 2006] and by Matlab [ ilbert et al. 1992]. The algorithm, pre- 
sented in Algorithm 13.11 uses the SPA described in Section 13.3.2. Once again, 
the multiple switch technique can be used to avoid the cost of resetting the SPA for 
every iteration of the outermost loop. As in the case of SpAdd, generally the space 
required to store C cannot be determined quickly. Repeated doubling or more so- 
phisticated methods such as Gohen's algorithm [ 'oli(>n 199.S] rnay be used. Gohen's 
algorithm is a randomized iterative algorithm that does 6(1) SpMV operations over 
a semiring to estimate the row and column counts. It can be efhciently implemented 
even on unordered triples. 

Algorithm 13.11. CSR matrix multiply. 

Operation C — AB using GSR. 

C : R'^Wx^ = CSR-SpGEMM(A : M'5Wx^,B : rS{K)xn-^ 

1 Set-SPA(SPA) [> Set w = 0, b = 0 and create empty hst LS 

2 C.IR(1)^0 

3 for z 1 to M 



4 do for /c A.IR(i) to A.IR(i + 1) 

5 do for j ^ B.IR(A.JC(fc)) to B.IR(A.JC(fc) + 1) 

6 do 

7 m^Mc ^ A.NUM(fc) • B.NUM(j) 

8 Scatter-SPA(SPA, value, B.JC(j)) 

9 nznew ^ Gather-SPA(SPA, C.NUM, C.JC, C.IR(ii)) 

10 C.\R{i + l)^C.\R{i) + nznew 

11 Reset-SPA(SPA) > Reset w = 0, b = 0 and empty LS 



The row wise SpGEMM implementation does 0( scan (A) -I- flops) cache misses 
in the worst case. Due to the size of the SPA and Assumption 2, the algorithm makes 
a cache miss for every flop. As long as no cache interference occurs between the 
nonzeros of A and the nonzeros of C(i, :), only scan (A) additional cache misses are 
made instead of nnz{K). 

1 3.5 Case study: Star-P 

This section summarizes how sparse matrices are represented in Star-P. It also 
includes how the key primitives are implemented in this real-world software solution. 

13.5.1 Sparse matrices in Star-P 

The current Star-P implementation includes dsparse (distributed sparse) matri- 
ces, which are distributed across processors by blocks of rows. This layout makes 
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the CSR data structure a logical choice to store the sparse matrix slice on each 
processor. 

The design of sparse matrix algorithms in Star-P follows the same design 
principles as in Matlab [Gilbert et al. 1992]. 

1. Storage required for a sparse matrix should be 0{nnz), proportional to the 
number of nonzero elements. 

2. Running time for a sparse matrix algorithm should be O (flops). It should be 
proportional to the number of floating-point operations required to obtain the 
result. 

The CSR data structure satisfies the requirement for storage as long as M < 
nnz. The second principle is difficult to achieve exactly in practice. Typically, most 
implementations achieve running time close to O(flops) for commonly used sparse 
matrix operations. For example, accessing a single element of a sparse matrix should 
be a constant-time operation. With a CSR data structure, it typically takes time 
proportional to the logarithm of the length of the row to access a single element. 
Similarly, insertion of single elements into a CSR data structure generates extensive 
data movement. Such operations are efficiently performed with the sparse/find 
routines, which work with triples rather than individual elements. 

Sparse matrix-dense vector multiplication (SpMV) 

The CSR data structure used in Star-P is efficient for multiplying a sparse matrix 
by a dense vector: y = Ax. It is efficient for communication and it shows good 
cache behavior for the sequential part of the computation. Our choice of the CSR 
data structure was heavily influenced by our desire to have good SpMV performance 
since it forms the core computational kernel for many iterative methods. 

The matrix A and vector x are distributed across processors by rows. The 
submatrix of A on each processor will need some subset of x depending upon its 
sparsity structure. When SpMV is invoked for the first time on a dsparse matrix 
A, Star-P computes a communication schedule for A and caches it. When later 
matvecs are performed using the same A, this communication schedule does not 
need to be recomputed, thus saving some computing and communication overhead 
at the cost of extra space required to save the schedule. We experimented with 
overlapping computation and communication in SpMV. It turns out in many cases 
that this is less efficient than simply performing the communication first, followed 
by the computation. As computer architectures evolve, this decision may need to 
be revisited. 

When multiplying from the left, y = x'A, the communication is not as effi- 
cient. Instead of communicating the required subpieces of the source vector, each 
processor computes its own destination vector. All partial destination vectors are 
then summed up into the final destination vector. The communication required is 
always 0{N). The choice of the CSR data structure, while making the commu- 
nication more efficient when multiplying from the right, makes it more difficult to 
multiply on the left. 
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Sparse matrix-sparse matrix multiplication 

Star-P stores its matrices in CSR form. Clearly, computing inner products (Al- 
gorithm 13. f) is inefficient, since rows of A cannot be efficiently accessed without 
searching. Similarly, in the case of computing outer products (Algorithm 13.2), 
rows of B have to be extracted. The process of accumulating successive rank-one 
updates is also inefRcient, as the structure of the result changes with each successive 
update. Therefore, Star-P uses the row wise formulation of matrix multiplication 
described in Section 13.2, in which the computation is set up so that only rows of 
A and B are accessed, producing a row of C at a time. 

The performance of sparse matrix multiplication in parallel depends upon the 
nonzero structures of A and B. A well-tuned implementation may use a polyalgo- 
rithm. Such a polyalgorithm may use different communication schemes for different 
matrices. For example, it may be efficient to broadcast the local part of a matrix to 
all processors, but in other cases, it may be efficient to send only the required rows. 
On large clusters, it may be efficient to interleave communication and computation. 
On shared memory architectures, however, most of the time is spent in accumulat- 
ing updates, rather than in communication. In such cases, it may be more efficient 
to schedule the communication before the computation. 



13.6 Conclusions 

In this chapter, we gave a brief survey on sparse matrix infrastructure for doing 
graph algorithms. We focused on implementation and analysis of key primitives on 
various sparse matrix data structures. We tried to complement the existing litera- 
ture in two directions. First, we analyzed sparse matrix indexing and assignment 
operations. Second, we gave I/O complexity bounds for all operations. Taking I/O 
complexities into account is key to achieving high performance in modern architec- 
tures with multiple levels of cache. 
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New Ideas in Sparse Matrix 
Matrix Multiplication 



Ay din Bulug* and John Gilbert^ 



Abstract 

Generalized sparse matrix matrix multiplication is a key primitive for 
many high performance graph algorithms as well as some linear solvers 
such as multigrid. We present the first parallel algorithms that achieve 
increasing speedups for an unbounded number of processors. Our al- 
gorithms are based on the two-dimensional (2D) block distribution of 
sparse matrices where serial sections use a novel hypersparse kernel for 
scalability. 

14.1 Introduction 

Development and implementation of large-scale parallel graph algorithms pose nu- 
merous challenges in terms of scalability and productivity [Lumsdaino et al. 2007, 
Yoo ct al. 2005]. Linear algebra formulations of many graph algorithms already 
exist in the literature; see [Alio ct al. 1974, Maggs k, Poltkin 1988, Tarjan 1981]. 
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By exploiting the duality between matrices and graphs, linear algebraic formula- 
tions aim to apply the existing knowledge on parallel matrix algorithms to parallel 
graph algorithms. One of the key linear algebraic primitives for graph algorithms 
is computing the product of two sparse matrices (SpGEMM) over a semiring. It 
serves as a building block for many algorithms including graph contraction algorithm 
[Gilbert ct al. 2008], breadth- first search from multiple-source vertices, peer pres- 
sure clustering [Shah 2007], recursive formulations of all-pairs shortest paths algo- 
rithms [D'Albcrto & Nicolau 2007], matching algorithms [Ral)in <t Vazirani 1989], 
and cycle detection [Yustcr & Zwick 2004], as well as for some other applications 
such as multigrid interpolation/restriction [Briggs ct al. 2000] and parsing context- 
free languages [Ponn 2()l)ri]. 

Most large graphs in applications, such as the WWW graph, finite element 
meshes, planar graphs, and trees are sparse. In this chapter, we consider a graph to 
be sparse if nnz = 0{N), where nnz is the number of edges and N is the number 
of vertices. Dense matrix multiplication algorithms are inefficient for SpGEMM 
since they require 0{N^) space and the current fastest dense matrix multiplication 
algorithm runs in 0{N^'^^) time; see [Coppersmith & Winograd 1987, S(i(l(4 1995]. 
Furthermore, fast dense matrix multiplication algorithms operate on a ring instead 
of a semiring, making them unsuitable for many algorithms on general graphs. For 
example, it is possible to embed the semiring into the ring of integers for the all- 
pairs shortest paths problem on unweighted and undirected graphs [Seidel 1995], 
but the same embedding does not work for weighted or directed graphs. 

Let A € S*^^^ be a sparse rectangular matrix of elements from an arbitrary 
semiring S. We use nnz{A) to denote the number of nonzero elements in A. When 
the matrix is clear from context, we drop the parentheses and simply use nnz. For 
sparse matrix indexing, we use the convenient Matlab colon notation, where A(:, i) 
denotes the ith column, A{i, :) denotes the ith row, and A{i,j) denotes the element 
at the (?,j)th position of matrix A. For one-dimensional arrays, a(z) denotes the 
ith component of the array. Sometimes, we abbreviate and use nnz{j) to denote 
the number of nonzero elements in the jth column of the matrix in context. Array 
indices are 1-based throughout this chapter. We use flops(AopB), pronounced 
"flops," to denote the number of nonzero arithmetic operations required by the 
operation AopB. Again, when the operation and the operands are clear from 
context, we simply use flops. 

The most widely used data structures for sparse matrices are the Compressed 
Sparse Columns (CSC) and Compressed Sparse Rows (CSR). The previous chap- 
ter gives concise descriptions of common SpGEMM algorithms operating both 
on CSC/CSR and triples. The SpGEMM problem was recently reconsidered in 
[Yustcr & Zwick 2005] over a ring, where the authors used a fast dense matrix 
multiplication such as arithmetic progression [Copijersmitli Winogracl 19<S7] as a 
subroutine. Their algorithm uses 0{nnz^-'^ N^-^ + N'^~^°^^^) arithmetic operations, 
which is theoretically close to optimal only if we assume that the number of nonzeros 
in the resulting matrix C is Q{N^). This assumption rarely holds in reality. Instead, 
we provide a work-sensitive analysis by expressing the computation complexity of 
our SpGEMM algorithms in terms of flops. 
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Practical sparse algorithms have been proposed by different researchers over 
the years (see, e.g., [Park ot al. 1992, Sulatycke (k Gliosc 1998]) using various data 
structures. Although they achieve reasonable performance on some classes of ma- 
trices, none of these algorithms outperforms the classical sparse matrix matrix 
multiplication algorithm for general sparse matrices, which was first described by 
Gustavson [Gustavson 1978] and was used in Matlab [Gilbert ct al. 1992] and 
CSparse [Davis ct al. 2006]. The classical algorithm runs in 0(flops + nnz+N) 
time. 

In Section 14.2, we present a novel algorithm for sequential SpGEMM that is 
geared toward computing the product of two hypersparse matrices. A matrix is hy- 
persparse if the ratio of nonzeros to its dimension is asymptotically 0. The algorithm 
is used as the sequential building block of our parallel 2D algorithms described in 
Section 14.3. Our Hypersparse_GEMM algorithm uses a new 0{nnz) data struc- 
ture, called DCSC for doubly compressed sparse columns, which is explained in 
Section 14.2.2. The Hypersparse_GEMM is based on the outer product formu- 
lation and has time complexity 0{nzc{A) -\- nzr(B) -f flops • \gni), where nzc{A.) 
is the number of columns of A that contain at least one nonzero, n2r(B) is the 
number of rows of B that contain at least one nonzero, and ni is the number of 
indices i for which A(:,i) ^ 0 and B(i, :) 7^ 0. The overall space complexity of our 
algorithm is only 0{nnz{A.) + nnz{&) -\- nnz{C)). Notice that the time complexity 
of our algorithm does not depend on N , and the space complexity does not depend 
on flops. 

Section 14.3 presents parallel algorithms for SpGEMM. We propose novel algo- 
rithms based on 2D block decomposition of data in addition to giving the complete 
description of an existing ID algorithm. To the best of our knowledge, parallel al- 
gorithms using a 2D block decomposition have not earlier been developed for sparse 
matrix matrix multiplication. 

Irony, Toledo, and Tiskin [Irony ct al. 2IJU4] proved that 2D dense matrix 
multiplication algorithms are optimal with respect to the communication volume, 
making 2D sparse algorithms likely to be more scalable than their one-dimensional 
(ID) counterparts. In Section 14.4, we show that this intuition is indeed correct by 
providing a theoretical analysis of the parallel performance of ID and 2D algorithms. 

In Section 14.5, we model the speedup of parallel SpGEMM algorithms using 
realistic simulations and projections. Our results show that existing ID algorithms 
are not scalable to thousands of processors. By contrast, 2D algorithms have the 
potential for scaling up indefinitely, albeit with decreasing parallel efficiency, which 
is defined as the ratio of speedup to the number of processors. 

14.2 Sequential sparse matrix multiply 

We first analyze different formulations of sparse matrix matrix multiplication by 
using the layered-graph model in Section 14.2.1. This graph theoretical explanation 
gives insights on the suitability of the outer product formulation for multiplying 
hypersparse matrices (Hypersparse_GEMM). Section 14.2.2 defines hypersparse 
matrices and Section 14.2.3 introduces the DCSC data structure that is suitable to 
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store hypersparse matrices. We present our Hypersparse_GEMM algorithm in 
Section 14.2.2. 

14.2.1 Layered graphs for different formulations of SpGEMM 

Matrix multiplication can be organized in many different ways. Tlie inner product 
formulation that usually serves as the definition of matrix multiplication is well 
known. Given two matrices A e M*^^^ and B e R^'^'^, each element in the 
product C e M*^^^ is computed by the following formula: 

K 

C{i,j)^Y.Mhk)Bik,j) (14.1) 

k=l 

This formulation is rarely useful for multiplying sparse matrices since it re- 
quires n{M ■ N) operations regardless of the sparsity of the operands. 

We represent the multiplication of two matrices A and B as a three-layered 
graph, following Cohen [Ccjhcni 1998]. The layers have M, K, and N vertices, in 
that order. The first layer of vertices (U) represents the rows of A, and the third 
layer of vertices {V) represents the columns of B. The second layer of vertices (W) 
represents the dimension shared between matrices. Every nonzero A(i, Z) 7^ 0 in the 
ith row of A forms an edge (ui, wi) between the first and second layers, and every 
nonzero in B(Z, j) ^ 0 in the jth column of B forms an edge (wi,Vj) between the 
second and third layers. 

We perform different operations on the layered graph depending on the way 
we formulate the multiplication. In all cases, though, the goal is to find pairs of 
vertices (ui,Vj) sharing an adjacent vertex Wk € W, and if any pair shares multiple 
adjacent vertices, to merge their contributions. 

Using inner products, we analyze each pair (ui,Vj) to find the set of vertices 
in Wij CW = {wi,W2, . ■ ■ ,Wk} that are connected to both Ui and Vj in the graph 
shown in Figure 14.1. The algorithm then accumulates contributions an ■ bij for all 
wi G Wij. The result becomes the value of C{i,j) in the output. In general, this 
inner product subgraph is sparse, and a contribution from wi happens only when 
both edges an and bij exist. However, this sparsity is not exploited using inner 
products as it needs to examine each {ui,Vj) pair, even when the set Wij is empty. 

In the outer product formulation, the product is written as the summation of 
k rank-one matrices: 

K 

C = ^A(:,A:)B(fc,:) (14.2) 

k=l 

A different subgraph results from this formulation as it is the set of vertices 
W that represent the shared dimension that plays the central role. Note that 
the edges are traversed in the outward direction from a node Wi € W , as shown in 
Figure 14.2. For sufficiently sparse matrices, this formulation may run faster because 
this traversal is performed only for the vertices in W (size K) instead of the inner 
product traversal that had to be performed for every pair (size MN). The problem 
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Figure 14.1. Graph representation of tiie inner product A(i, :) • B(:, j). 




Figure 14.2. Graph representation of the outer product A(:,?) • B(z, :). 



with outer product traversal is that it is hard to accumulate the intermediate results 
into the final matrix. 

A row-by-row formulation of matrix multiplication performs a traversal start- 
ing from each of the vertices in U towards V, as shown in Figure 14.3 for Ui. Each 
traversal is independent from each other because they generate different rows of 
C. Finally, a column- by-column formulation creates an isomorphic traversal, in the 
reverse direction (from V to U). 
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Figure 14.3. Graph representation of the sparse row times matrix 
product A{i, :) • B. 

14.2.2 Hypersparse matrices 

Recall that a matrix is hypersparse if nnz < N. Although CSR/CSC is a fairly effi- 
cient storage scheme for general sparse matrices having nnz = Q{N), it is asymptot- 
ically suboptimal for hypersparse matrices. Hypersparse matrices are fairly rare in 
numerical linear algebra (indeed, a nonsingular square matrix must have nnz > N), 
but they occur frequently in computations on graphs, particularly in parallel. 

Our main motivation for hypersparse matrices comes from parallel processing. 
Hypersparse matrices arise after the 2D block data decomposition of ordinary sparse 
matrices for parallel processing. Consider a sparse matrix with c nonzero elements in 
each column. After the 2D decomposition of the matrix, each processor locally owns 
a submatrix with dimensions (N/y/p) x (N/y/p). Storing each of those submatrices 
in CSC format takes 0{Ny/p+ nnz) space, whereas the amount of space needed to 
store the whole matrix in CSC format on a single processor is only 0{N + nnz). As 
the number of processors increases, the N^/p term dominates the nnz term. 

Figure 14.4 shows that the average number of nonzeros in a single column of a 
submatrix, nnz(j), goes to zero as p increases. Storing a graph using CSC is similar 
to using adjacency lists. The column pointers array represents the vertices, and the 
row indices array represents their adjacencies. In that sense, CSC is a vertex-based 
data structure, making it suitable for ID (vertex) partitioning of the graph. On 
the other hand, 2D partitioning is based on edges. Therefore, using CSC with 2D 
distributed data is forcing a vertex-based representation on edge distributed data. 
The result is unnecessary replication of column pointers (vertices) on each processor 
along the processor column. 

The inefficiency of CSC leads to a more fundamental problem: any algorithm 
that uses CSC and scans all the columns is not scalable for hypersparse matrices. 
Even without any communication at all, such an algorithm cannot scale for > 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



14.2. Sequential sparse matrix multiply 



321 



.^/p blocks 



■ nnz{i) = c 



■Jp blocks 

Figure 14.4. 2D sparse matrix decomposition. 



max{Qops, nnz}. SpMV and SpGEMM are algorithms that scan column indices. 
For these operations, any data structure that depends on the matrix dimension 
(such as CSR or CSC) is asymptotically too wasteful for submatrices. 



1 4.2.3 DCSC data structure 

We use a new data structure for our sequential hypersparse matrix matrix multi- 
plication. This structure, called DCSC for doubly compressed sparse columns, has 
the following properties. 

1. It uses 0{nnz) storage. 

2. It lets the hypersparse algorithm scale with increasing sparsity. 

3. It supports fast access to columns of the matrix. 

For an example, consider the 9x9 matrix with 4 nonzeros whose triples 
representation is given in Figure 14.6. Figure 14.5 shows its CSC storage, which 
includes repetitions and redundancies in the column pointers array (JC). Our new 
data structure compresses the JC array to avoid repetitions, giving the CP(column 
pointers) array of DCSC as shown in Figure 14.7. DCSC is essentially a sparse 
array of sparse columns, whereas CSC is a dense array of sparse columns. 

After removing repetitions, CP[i] does no longer refer to the ith column. A 
new JC array, which is parallel to CP, gives us the column numbers. Although our 
Hypersparse_GEMM algorithm does not need column indexing, DCSC supports 
fast column indexing for completeness. Whenever column indexing is needed, we 
construct an AUX array that contains pointers to nonzero columns (columns that 
have at least one nonzero element). Each entry in AUX refers to a [n/ nzc] -sized 
chunk of columns, pointing to the first nonzero column in that chunk (there might 
be none). The storage requirement of DCSC is 0{nnz) since |NUM| = |IR| = nnz, 
|JC| = nzc, |CP| = nzc+1, and |AUX| « nzc. 
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JC= 1 33333 3 455 

IR = 6 8 4 2 

NUM = 0.1 0.2 0.3 0.4 



Figure 14.5. Matrix A in CSC format. 
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Figure 14.6. Matrix A in 
triples format. 



Figure 14.7. Matrix A in 
DCSC format. 



In our implementation, the AUX array is a temporary work array that is 
constructed on demand, only when an operation requires repetitive use of it. This 
keeps the storage and copying costs low. The time to construct AUX is only 0{nzc), 
which is subsumed by the cost of multiplication. 

'[A.IA A sequential algorithm to multiply hypersparse matrices 

The sequential hypersparse algorithm (Hypersparse_GEMM) is based on outer 
product multiplication. Therefore, it requires fast access to rows of matrix B. 
This could be accomplished by having each input matrix represented in DCSC 
and also in DCSR (doubly compressed sparse rows), which is the same as the 
transpose in DCSC. This method, which we described in an early version of this 
work [Bulnc .'v- Gil1iert 200S1)], doubles the storage but does not change the asymp- 
totic space and time complexities. Here, we describe a more practical version where 
B is transposed as a preprocessing step, at a cost of trans(B). The actual cost of 
transposition is either 0{N + nnz(^)) or 0(nnz(B) IgnKz(B)), depending on the 
implementation. 

The idea behind the Hypersparse_GEMM algorithm is to use the outer 
product formulation of matrix multiplication efficiently. The first observation about 
DCSC is that the JC array is already sorted. Therefore, A.JC is the sorted indices 
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Figure 14.8. Cartesian product and the multiway merging analogy. 
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Figure 14.9. Nonzero structures of operands A and B. 

of the columns that contain at least one nonzero, and similarly B'^.JC is the sorted 
indices of the rows that contain at least one nonzero. In this formulation, the zth 
column of A and the ith row of B are multiplied to form a rank-one matrix. The 
naive algorithm does the same procedure for all values of i and gets n different rank- 
one matrices, adding them to the resulting matrix C as they become available. Our 
algorithm has a preprocessing step that finds intersection Isect = A.JC n B'^.JC, 
which is the set of indices that participate nontrivially in the outer product. 

The preprocessing takes 0{nzc(A) + n2r(B)) time as |A.JC| = nzc{A) and 
|B'^.JC| = nzr(B). The next phase of our algorithm performs |Isect| Cartesian 
products, each of which generates a fictitious list of size nnz(A(:,i)) ■ nnz(B(z, :)). 
The lists can be generated sorted because all the elements within a given column 
are sorted according to their row indices (i.e., IR(JC(z)) . . . IR(JC(i) + 1) is a sorted 
range). The algorithm merges those sorted lists, summing up the intermediate 
entries having the same {row-id, colSd) index pair, to form the resulting matrix 
C. Therefore, the second phase of Hypersparse_GEMM is similar to multiway 
merging [Knuth 1997]. The only difference is that we never explicitly construct the 
lists; we compute their elements one by one on demand. 

Figure 14.8 shows the setup for the matrices from Figure 14.9. Since A.JC = 
{1,2,3,4,6} and B'^.JC = {1,3,4,5,6}, Isect = {1,3,4,6} for this product. The 
algorithm does not touch the shaded elements since they do not contribute to the 
output. 
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The merge uses a priority queue (represented as a heap) of size ni, which is 
the size of Isect, the number of indices i for which A(:, i) ^% and B(i, :) ^ 0. The 
value in a heap entry is its NUM value and the key is a pair of indices in 
column-major order. The idea is to repeatedly extract the entry with minimum key 
from the heap and insert another element from the list that the extracted element 
originally came from. If there are multiple elements in the lists with the same key, 
then their values are added on the fly. If we were to explicitly create ni lists instead 
of doing the computation on the fly, we would get the lists shown in the right side of 
Figure 14.8, which are sorted from bottom to top. For further details of multiway 
merging, consult Knuth [Knuth 1997]. 

The time complexity of this phase is O (flops • Ig ni), and the space complexity 
is 0{nnz{C) + ni). The output is a stack of NUM values in column-major order. 
The nnz[C) term in the space complexity comes from the output, and the flops 
term in the time complexity comes from the observation that 



The final phase of the algorithm constructs the DCSC structure from this 
column-major ordered stack. This requires 0{nnz[C)) time and space. 

The overall time complexity of our algorithm is 0{nzc{A.) -f nzr(Bi) + flops • 
Igm), plus the preprocessing time to transpose matrix B. Note that nnz{C) does 
not appear in this bound since nnz{C) < flops. We opt to keep the cost of transpo- 
sition separate because our parallel 2D block SpGEMM will amortize this transposi- 
tion of each block over uses of that block. Therefore, the cost of transposition will 
be negligible in practice. The space complexity is 0{nnz{A) + nn2(B) -I- nnz{C)). 
The time complexity does not depend on N, and the space complexity does not 
depend on flops. 

Algorithm 14.1 gives the pseudocode for the whole algorithm. It uses two sub- 
procedures: CartMult-Insert generates the next element from the ith fictitious 
list and inserts it to the heap PQ, and Increment-List increments the pointers of 
the ith fictitious list or deletes the list from the intersection set if it is empty. 

To justify the extra logarithmic factor in the flops term, we briefly analyze the 
complexity of each submatrix multiplication in the parallel 2D block SpGEMM. 
Our parallel 2D block SpGEMM performs p^/p submatrix multiplications since 

each submatrix of the output is computed using Cij — E^iA.feBfcj. There- 
fore, with increasing number of processors and under perfect load balance, flops 
scales with 1/py/p, nnz scales with 1/p, and N scales with 1/-^. Figure 14.10 
shows the trends of these three complexity measures as p increases. The graph 
shows that the N term becomes the bottleneck after around 50 processors and flops 
becomes the lower-order term. In contrast to the classical algorithm, our Hyper- 
SPARSE_GEMM algorithm becomes independent of N, by putting the burden on 
the flops instead. 




zGlscct 
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Algorithm 14.1. Hypersparse matrix multiply. 

Pseudocode for hypersparse matrix matrix multiplication algorithm. 

C : M^CMxTV) ^ HyperSparseGEMM(A : R^(^^-^),BT : RSiNxK)^ 



1 isect ^ Intersection(A.JC,B'^.JC) 

2 for J 1 to |isect| 

3 do CartMult-Insert(A, B"^, PQ, isect, j) 

4 lNCREMENT-LlST(isect, j) 

5 while IsNoTFiNiSHED(isect) 

6 do {key, value) ^ Extract-Min(PQ) 

7 {product, i) ^ \jNPAm{value) 

8 if key i- Top(Q) 

9 then Enqueue(Q, key, product) 

10 else V pbateT OP {Q, product) 

11 if IsNoTEMPTY(isect(i)) 

12 then CartMult-Insert(A, B"^, PQ, lists, isect, i) 

13 lNCREMENT-LlST(isect, i) 

14 CONSTRUCT-DCSC(Q) 




Figure 14.10. Trends of different complexity measures for subma- 
trix multiplications as p increases. The inputs are randomly permuted R-MAT 
matrices (scale 15 with an average of 8 nonzeros per column) that are successively 
divided into {N/ ^) x {N/ ^). The counts are averaged over all submatrix multi- 
plications. 
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14.3 Parallel algorithms for sparse GEMM 

This section describes parallel algorithms for multiplying two sparse matrices in 
parallel on p processors, which we call PSpGEMM. The design of our algorithms 
is motivated by distributed memory systems, but we expect them to perform well 
in shared memory too, as they avoid hot spots and load imbalances by ensuring 
proper work distribution among processors. Like most message passing algorithms, 
they can be implemented in the partitioned global address space (PGAS) model as 
well. 

14.3.1 1 D decomposition 

We assume the data is distributed to processors in block rows, where each processor 
receives M/p consecutive rows. We write — A{ip : {i + l)p — 1, :) to denote the 
block row owned by the ith processor. To simplify the algorithm description, we 
use Aij to denote Ai{:,jp : (j + l)p — 1), the jth block column of A^, although 
block rows are not physically partitioned: 



A = 




/An 




Alp 










{ Ap ) 


V Api 









/Bi 


\ 


,B = 






/ 


V Bp 





(14.3) 



For each processor P{i), the computation is 
C, = C, + A, B = C, 



p 



14.3.2 2D decomposition 

Our 2D parallel algorithms. Sparse Cannon and Sparse SUMMA, use the hyper- 
sparse algorithm, which has complexity 0(nzc(A) + n2r(B) + flops • Ig ni) , as shown 
in Section 14.2.2, for multiplying submatrices. Processors are logically organized 
on a square ^/p x ^/p mesh, indexed by their row and column indices so that the 
(i,j)th processor is denoted by P{i,j). Matrices are assigned to processors ac- 
cording to a 2D block decomposition. Each node gets a submatrix of dimensions 
(N/y/p) X {N/sJp) in its local memory. For example, A is partitioned as shown 
below and Aij is assigned to processor P{i,j): 





f An 




Ai^ 


A = 














-^VpVp 



(14.4) 



For each processor P{i), the computation is 



VP 

k=l 



B 



kj 
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14.3.3 Sparse 1 D algorithm 

The row wise SpGEMM forms one row of C at a time, and each processor may 
potentially need to access all of B to form a single row of C. However, only a 
portion of B is locally available at any time in parallel algorithms. The algorithm 
thus performs multiple iterations to fully form one row of C. We use a SPA to 
accumulate the nonzeros of the current active row of C. Algorithm 14.2 shows the 
pseudocode of the algorithm. Loads and unloads of SPA, which is not amortized by 
the number of nonzero arithmetic operations in general, dominate the computational 
time. 

Algorithm 14.2. Matrix matrix multiply. 

Operation C ^ AB using block row SparselD. 

C : RPisiN)xN) ^ Block1D-PSpGEMM(A : R^(^Wx^),B : RPisiN)xN)^ 
1 for all processors P{i) in parallel 



2 do Initialize(SPA) 

3 for j 1 to p 

4 do Broadcast(Bj) 

5 for -s— 1 to N/p 

6 do LOAD(SPA,Ci(fc, :)) 

7 SPA^SPA + Aij{k,:)Bj 

8 UNLOAD(SPA,Ci(fc, :)) 



14.3.4 Sparse Cannon 

Our first 2D algorithm is based on Cannon's algorithm for dense matrices (see 
[Cannon 1969]). The pseudocode of the algorithm is given in Algorithm 14.5. Sparse 
Cannon, although elegant, is not our choice of algorithm for the final implementa- 
tion, as it is hard to generalize to nonsquare grids, nonsquare matrices, and matrices 
whose dimensions are not perfectly divisible by grid dimensions. 

Algorithm 14.3. Circular shift left. 

Circularly shift left by s along the processor row. 

LEFT-ClRCULAR-SHIFT(Local ; R'5(^x^), s) 

1 SEND(Local, P{i, [j — s) mod ^/p)) t> This is processor P{i,j) 

2 RECEiVE(Temp, P{i, {j + s) mod ^/p)) 

3 Local ^ Temp 

Algorithm 14.4. Circular shift up. 

Circularly shift up by s along the processor column. 

UP-CiRCULAR-SHiFT(Local : M'5(^><^),s) 

1 SEND(Local, P{{i — s) mod ^/p,j)) t> This is processor P{i,j) 

2 RECEiVE(Temp, P{{i + s) mod ^Jp,j)) 

3 Local Temp 
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Algorithm 14.5. Cannon matrix multiply. 

Operation C -s— AB using Sparse Cannon. 

C . ^PisiNxN)) ^ Cannon-PSpGEMM(A : M^('5(^x^)), B : RPiS(N>.N))-^ 



1 for all processors P{i,j) in parallel 

2 do LEFT-ClRCULAR-SHIFT(Aij, « - 1) 

3 UP-ClRCULAR-SHIFT(B,y , j - 1) 

4 for all processors Pij in parallel 

5 do for fc -s— 1 to y/p 

6 do Cij -s- Cij + Aij Bij 

7 Left-Circular-Shift(Aij, 1) 

8 UP-ClRCULAR-SHIFT(By , 1) 



14.3.5 Sparse SUMMA 

SUMMA [Van De Geijn & Watts 1997] is a memory efficient, easy to generalize al- 
gorithm for parallel dense matrix multiplication. It is the algorithm used in parallel 
BLAS (see [Chtclielkanova ct al. 1997]). As opposed to Cannon's algorithm, it al- 
lows a tradeoff to be made between latency cost and memory by varying the degree 
of blocking. The algorithm, illustrated in Figure 14.11, proceeds in k/b stages. At 
each stage, active row processors broadcast b columns of A simultaneously along 
their rows and ^/p active column processors broadcast b rows of B simultaneously 
along their columns. 

Sparse SUMMA is our algorithm of choice for our final implementation because 
it is easy to generalize to nonsquare matrices, matrices whose dimensions are not 
perfectly divisible by grid dimensions. 

14.4 Analysis of parallel algorithms 

In this section, we analyze the parallel performance of our algorithms and show that 
they scale better than existing ID algorithms in theory. We begin by introducing 
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Figure 14.11. Sparse SUMMA execution {b = N/^), 
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our parameters and model of computation. Tlien, we present a theoretical analysis 
showing that ID decomposition, at least with the current algorithm, is not sufficient 
for PSpGEMM to scale. Finally, we analyze our 2D algorithms in depth. 

In our analysis, the cost of one floating-point operation, along with the cost of 
cache misses and memory indirections associated with the operation, is denoted by 
7, measured in nanoseconds. The latency of sending a message over the communi- 
cation interconnect is a, and the inverse bandwidth is /3, measured in nanoseconds 
and nanoseconds per word transferred, respectively. The running time of a parallel 
algorithm on p processors is given by 

where Tcomm denotes the time spent in communication and Tcomp is the time spent 
during local computation phases. Tcomm includes both the latency (delay) costs 
and the actual time it takes to transfer the data words over the network. Hence, 
the cost of transmitting h data words in a communication phase is 

Tcomm (y ^ hp 

The sequential work of SpGEMM, unlike dense GEMM, depends on many 
parameters. This makes parallel scalability analysis a tough process. Therefore, 
we restrict our analysis to sparse matrices following the Erdos-Renyi graph model. 
Gonsequently, the analysis is probabilistic, exploiting the independent and iden- 
tical distribution of nonzeros. When we talk about quantities such as nonzeros 
per subcolumn, we mean the expected number of nonzeros. Our analysis assumes 
that there are c > 0 nonzeros per row/column. The sparsity parameter c, albeit 
oversimplifying, is useful for analysis purposes since it makes different parameters 
comparable to each other. For example, if A and B both have sparsity c, then 
nnz{A) = cN and flops(AB) — c^N. It also allows us to decouple the effects of 
load imbalances from the algorithm analysis because the nonzeros are assumed to 
be evenly distributed across processors. 

The lower bound on sequential SpGEMM is f^(flops) = n{c^N). This bound is 
achieved by some row wise and column wise implementations (see [Gilbert et al. 1992, 
Gusta\'soii 197.S]), provided that c > 1. The row wise implementation of Gustavson 
that uses GSR is the natural kernel to be used in the ID algorithm where data 
is distributed by rows. As shown in the previous chapter, it has an asymptotic 
complexity of 

0{N + nnz{A) + flops) = 0{N + cN + c^N) = Q{c^N) 
Therefore, we take the sequential work [W) to be ^c^N in our analysis. 

14.4.1 Scalability of the 1 D algorithm 

We begin with a theoretical analysis whose conclusion is that ID decomposition is 
not sufficient for PSpGEMM to scale. In Block1D_PSpGEMM, each processor 
sends and receives p — 1 point-to-point messages of size nnz{'B)/p. Therefore, 

TIT) Zl !B 1 

Tcomm = {p~l){a + (5 L^) ^e(pa + /3c7V) (14.5) 
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We previously showed that the Block1D_PSpGEMM algorithm is unscal- 
able with respect to both communication and computation costs (see, for instance, 
[Buluc & Gilbert 2008a]). In fact, it gets slower as the number of processors grows. 
The current Star-P implementation (see [Shah 2007]) bypasses this problem by 
all-to-all broadcasting nonzeros of the B matrix, so that the whole B matrix is es- 
sentially assembled at each processor. This avoids the cost of loading and unloading 
SPA at every stage, but it uses nnz(B) memory at each processor. 



1 4.4.2 Scalability of the 2D algorithms 

In this section, we provide an in-depth theoretical analysis of our parallel 2D 
SpGEMM algorithms and conclude that they scale significantly better than their 
ID counterparts. Although our analysis is limited to the Erdos-Renyi model, its 
conclusions are strong enough to be convincing. 

In Cannon_PSpGEMM, each processor sends and receives y/p — 1 point-to- 
point messages of size nnz(A)/p, and ^/p— 1 messages of size nnz{'B)/p. Therefore, 
the communication cost per processor is 

^ f nnz(A) + nnz(B)\\ ^( ^ BcN\ 

The average number of nonzeros in a column of a local submatrix A^- is c/^/p. 
Therefore, for a submatrix multiplication AifeB^j, 



m(Aife, Bfej) = min 

flops(AifeBfej) 



( c'] N . ( N c'^N] 
< 1 , — > = mm < , > 

{ p ) Vp IVp pVp ) 

flops(AB) _c^N 



pVp pVp 

^ L ■ c ] N c^N / . ( N c'N]\ 

^ ^ vp^^p'' r''\v=p^p\). 

The probability of a single column of A^fc (or a single row of B^j ) having at 
least one nonzero is min{l, c/^/p}, where 1 covers the case p < and c/^/p covers 
the case p > c^. 

The overall cost of additions, using p processors and Brown and Tarjan's 
0{m \gn/m) algorithm [Brown & Tarjan 1979] for merging two sorted lists of size 
m and n (for m < n), is 

^ f flops .\ flops . flops , , 

Note that we might be slightly overestimating since we assume flops/ nnz{C) « 
1 for simplicity. From Stirling's approximation and asymptotic analysis, we know 
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that lg(7T.!) = Q{nlgn) [Cormcn ct al. 20(Jl]. Thus, we get 

There are two cases to analyze: p > and p < c^. Since scalabihty analysis 
is concerned with the asymptotic behavior as p increases, we just provide results 
for the p > case. The total computation cost Tcomp = TmuU + Tadd is 



cn c n 



■^"^ / c^n\ c^nlg^\ / cn c^n 



In this case, parallel efficiency is 

E--, V = , . , (14.8) 

P [Tcomp + Tcomrn) (7 + 0) Cn y/p + 'fC^n Ig (^) + a p y/p 

Scalability is not perfect and efficiency deteriorates as p increases due to the 
first term. Speedup is, however, not bounded, as opposed to the ID case. In par- 
ticular, Ig (c^n/p) becomes negligible as p increases, and scalability due to latency 
is achieved when jc^n cx ap^yp, where it is sufficient for n to grow on the order 
of p^-^. The biggest bottleneck for scalability is the first term in the denominator, 
which scales with y/p. Consequently, two different scaling regimes are likely to be 
present: a close to linear scaling regime until the first term starts to dominate the 
denominator and a ^/p-scaling regime afterwards. 

Compared to the ID algorithms, Sparse Cannon both lowers the degree of 
unscalability due to bandwidth costs and mitigates the bottleneck of computation. 
This makes overlapping communication with computation more promising. 

Sparse SUMMA, like dense SUMMA, incurs an extra cost over Cannon for 
using row wise and column wise broadcasts instead of nearest-neighbor commu- 
nication, which might be modeled as an additional O(lgp) factor in communica- 
tion cost. Other than that, the analysis is similar to Sparse Cannon and wc omit 
the details. Using the DCSC data structure, the expected cost of fetching b con- 
secutive columns of a matrix A is 6 plus the size (number of nonzeros) of the 
output [Buhic & Gilbert 2008b]. Therefore, the algorithm asymptotically has the 
same computation cost for all values of b. 



14.5 Performance modeling of parallel algorithms 

In this section, we project the estimated speedup of ID and 2D algorithms in order 
to evaluate their prospects in practice. We use a quasi-analytical performance model 
in which we first obtain realistic values for the parameters (7, /3, a) of the algorithm 
performance, then use them in our projections. 

In order to obtain a realistic value for 7, we performed multiple runs on an 
AMD Opteron 8214 (Santa Rosa) processor using matrices of various dimensions 
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Figure 14.12. Modeled speedup of synchronous sparse ID algorithm. 



and sparsity, estimating the constants using nonlinear regression. One surprising re- 
sult is the order of magnitude difference in the constants between sequential kernels. 
The classical algorithm, which is used as the ID SpGEMM kernel, has 7 — 293.6 
nsec, whereas Hypersparse_GEMM, which is used as the 2D kernel, has 7 = 19.2 
nsec. We attribute the difference to cache friendliness of the hypersparse algorithm. 
The interconnect supports = 1 GB/sec point-to-point bandwidth and a max- 
imum of a ~ 2.3 microseconds latency, both of which are achievable on TACC's 
Ranger Cluster. The communication parameters ignore network contention. 

Figures 14.12 and 14.13 show the modeled speedup of Block1D_PSpGEMM 
and Cannon_PSpGEMM for matrix dimensions from N — 2^"^ to 2^"' and number 
of processors from p = 1 to 4096. The inputs are Erdos-Renyi graphs. 

We see that Block1D_PSpGEMM's speedup does not go beyond 50x, even 
on larger matrices. For relatively small matrices, having dimensions N — 2^^ — 2^°, 
it starts slowing down after a thousand processors, where it achieves less than 40x 
speedup. On the other hand, Cannon_PSpGEMM shows increasing and almost 
linear speedup for up to 4096 processors even though the slope of the curve is less 
than one. It is crucial to note that the projections for the ID algorithm are based 
on the memory inefficient implementation that performs an all-to-all broadcast of 
B. This is because the original memory efficient algorithm given in Section 14.3.1 
actually slows down as p increases. 

It is worth explaining one peculiarity. The modeled speedup turns out to 
be higher for smaller matrices than for bigger matrices. Remember that commu- 
nication requirements are on the same order as computational requirements for 
parallel SpGEMM. Intuitively, the speedup should be independent of the matrix 
dimension in the absence of load imbalance and network contention, but since we 
are estimating the speedup with respect to the optimal sequential algorithm, the 
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Figure 14.13. Modeled speedup of synchronous Sparse Cannon. 



overheads associated with the hypersparse algorithm are bigger for larger matri- 
ces. The bigger the matrix dimension, the slower the hypersparse algorithm is with 
respect to the optimal algorithm, due to the extra logarithmic factor. Therefore, 
speedup is better for smaller matrices in theory. This is not the case in practice 
because the peak bandwidth is usually not achieved for small-sized data transfers 
and load imbalances are severer for smaller matrices. 

We also evaluate the effects of overlapping communication with computa- 
tion. Following Krishnan and Nieplocha [Krishnan & Nieploclia 2UU4], we define 
the nonoverlappcd percentage of communication as 



T T —T 

W = \ — = 



The speedup of the asynchronous implementation is 

W 

g — 

Tcomp ~t~ Uj(TQomm,) 

Figure 14.14 shows the modeled speedup of asynchronous SpCannon assuming 
truly one-sided communication. For smaller matrices with dimensions N = 2^^— 2^*^, 
speedup is about 25% more than the speedup of the synchronous implementation. 

The modeled speedup plots should be interpreted as upper bounds on the 
speedup that can be achieved on a real system using these algorithms. Achieving 
these speedups on real systems requires all components to be implemented and 
working optimally. The conclusion we derive from those plots is that no matter how 
hard we try, it is impossible to get good speedup with the current ID algorithms. 
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Figure 14.14. Modeled speedup of asynchronous Sparse Cannon. 
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Chapter 1 5 

Parallel Mapping of Sparse 
Computations 
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Abstract 

Expressing graph algorithms in the language of linear algebra aids strongly 
in automatically mapping algorithms onto parallel architectures. Ma- 
trices extend naturally to parallel mapping schemes. They allow for 
schemes not strictly based around the source or destination vertex of 
the edge, but rather the pairwise combination of the two. The problem 
of mapping over sparse data, especially data distributed according to 
a power law, is difficult. Automated techniques are best for achieving 
optimal parallel throughput. 

15.1 Introduction 

Previous chapters have defined many common data layouts, or maps, used for paral- 
lel arrays. These maps are most appropriate for processing dense data. Sparse data, 
especially sparse power law data, requires more complex data mapping techniques. 

For sparse data, the maps yielding high performance are intricate and non- 
intuitive, making them next to impossible for a human to discover. Figure 15.1 
shows the scaling as the number of processors increases for an edge betweenness 
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Figure 15.1. Performance scaling. 

Scaling of edge betweenness centrality algorithm on a cluster. The al- 
gorithm's peak performance is obtained after using only 32 machines. 



centrality algorithm that uses a sparse adjacency matrix. The maps in this case 
were manually chosen on the basis of algorithm analysis. While the maps do yield 
initial performance benefits, this quickly flattens out after around 8 to 16 processors 
are added. The actual performance falls well short of the desired performance for 
the algorithm. 

While it is impossible to find an optimal data layout for many sparse prob- 
lems in a tractable amount of time, a good layout can commonly be found using 
automated search techniques. The computational cost of finding an ideal map for a 
particular set of inputs to a program is typically high. In order for data mapping to 
be beneficial, the reward for finding a good map must be correspondingly high. In 
general, the benefit of data mapping can be seen for two different types of problems: 

• problems where at least one input is used repeatedly, and 

• problems where input content may vary, but data layout remains constant. 

In both of these cases, the cost of mapping is amortized over multiple calls to the 
mapped program. This, in many cases, allows the performance gains to outweigh 
the upfront cost for optimization. 

Automated mapping can be done in a variety of ways. Oftentimes the process 
is not completely automated, but may have a human in the loop to point the 
mapping program in the correct direction. The rest of this chapter presents a 
detailed description of one software approach to this problem. 

15.2 Lincoln Laboratory mapping and optimization 
environment 

Linear algebra operations have long been in use in front-end processing for sen- 
sor processing applications. There is an increasing demand for linear algebra in 
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back-end processing, where graph apphcations dominate. Many graph algorithms 
have hnear algebraic equivalents using a sparse adjacency matrix. These algorithms, 
such as breadth-first search, clustering, and centrality, form the backbone of many 
typical back-end applications. By leveraging this representation, it becomes possi- 
ble to process both front- and back-end data in the same manner, using algebraic 
operations on matrices. 

With growing demands on the performance of sensor applications, parallel 
implementations of these algorithms become necessary. For front-end processing, 
which typically operates on dense matrices, this yields large rewards for very little 
effort. Parallel strategies for dense matrix operations are well studied and yield good 
performance results. However, there is no clear method for efficiently parallelizing 
the sparse matrix operations typical to back-end processing. 

The difference between sparse and dense algorithm efficiency arises due to 
differences in their data-access patterns. While dense matrix algorithms naturally 
access large continuous chunks of data, allowing them to take advantage of stream- 
ing access, this is not true for sparse matrices. Naive sparse matrix algorithms 
typically require random access, which, on modern architectures, incurs a high 
cost. This cost is more pronounced for parallel processing, where remote random 
access requires use of both the memory and the network. Sparse matrix algorithms 
typically underperform their dense matrix counterparts in terms of op/s (operations 
per second) due to the large memory and network latencies they incur. 

15.2.1 LLMOE overview 

To overcome memory and network latencies, the software-hardware interaction must 
be co-optimized. The Lincoln Laboratory Mapping and Optimization Environment 
(LLMOE) simplifies the co-optimization process by abstracting both the algorithm 
and hardware implementations and provides tools for analyzing both. Our system, 
implemented in Matlab using pMatlab [Cliss .t' Kcpncr 2007] for parallelization, is 
shown in Figure 15.2. This figure illustrates the four main components of LLMOE. 

• The program analysis component is responsible for converting the user pro- 
gram, taken as input, into a parse graph, a description of the high-level oper- 
ations and their dependencies on one another. 

• The data mapping component is responsible for distributing the data of each 
variable specified in the user code across the processors in the architecture. 

• The operations analysis component is responsible for taking the parse graph 
and data maps and forming the dependency graph, a, description of the low- 
level operations and their dependencies on one another. 

• The architecture simulation component is responsible for taking the depen- 
dency graph and a model of a hardware architecture and simulating it on 
that architecture. Once the simulation is finished, the results can either be 
returned to the user or fed back into the data mapping component in order 
to further optimize the data distribution. 
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Figure 15.2. LLMOE. 

Overview of the LLMOE framework and its components. 
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LLMOE's modular design allows for various hardware and software implemen- 
tations to be evaluated relative to one another. The data mapping and architecture 
simulation are designed to be pluggable components to provide flexibility. This de- 
sign lets users determine the best software-hardware pairing for their application. In 
addition, individual operator implementations (e.g., matrix multiplication) can be 
analyzed because LLMOE supports the definitions of operations at the dependency 
graph level. LLMOE provides a standard interface in the program and operations 
analysis components via the parse tree and dependency graph, respectively. 

Currently, LLMOE supports three data mappers. These mappers typically 
specify maps using a PITFALLS data structure [Ramaswaniy & Bancrjec 199-5], 
though other formats are possible. The manual mapper accepts fully specified data 
maps from the user. The genetic mapper uses a genetic algorithm and the feedback 
from the architecture simulation to optimize the map selection. Finally, the atlas 
mapper looks up the appropriate data maps for certain types of operations and 
sparsity patterns in a predefined atlas of maps. 

In addition, LLMOE currently defines two architecture simulators. The in- 
put to each of these simulators includes the dependency graph and an architectural 
model that considers memory and network bandwidth/latency, CPU rate, and net- 
work topology. The topological simulator partitions the dependency graph into 
stages based on the dependencies and then executes each stage sequentially, with 
the instructions in the stage being executed in parallel. The event-driven simulator 
uses PhoenixSim [Chan et al. 2010], an event-driven simulator primarily for pho- 
tonic networks built on top of OMNeT [P()n,2,('i], to simulate the dependency graph 
operations. An operation is scheduled as soon as all of its dependencies have been 
fulfilled. 

The LLMOE system eases the efforts involved in co-optimization of matrix 
operations, of key importance in the intelligence, surveillance, and reconnaissance 
(ISR) domain. It uses a front-end Matlab interface to allow users to input their 
application in a language familiar to them. In addition to the pluggable components 
for data mapping and architecture simulation, the user may also specify new low- 
level operation implementations. The system is itself parallel, leveraging pMatlab 
to speed up mapping and simulation time. 

15.2.2 Mapping in LLMOE 

In order to find the best mapping for a sparse matrix multiplication operation, the 
efficiency of computation and communication at a fine-grain level are co-optimized. 
Note that while data maps provide the distribution information for the matrices, 
they do not provide routing information. Once the maps for the arrays are defined, 
the set of communication operations that must occur can be enumerated. However, 
in order to evaluate the performance of a given mapping, a route for each com- 
munication operation must be chosen. These routes cannot be enumerated until 
a map is defined. Figure 15.3 illustrates a simplified mapping example over a dis- 
tributed addition. In the figure, the dependency graph communication operations 
are highlighted. For each highlighted operation, a number of routing options ex- 
ists. The number of possible routes is dependent on the topology of the underlying 
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Figure 15.3. Parallel addition with redistribution. 

Top: illustrates how a parallel addition is mapped. Bottom: presents the 
dependency graph for the addition. For each communication operation 
(highlighted) a number of possible routes exist. 



hardware architecture. Since the best map depends on chosen routes and routes 
cannot be enumerated until a map is chosen, the problem is combinatorial and no 
closed form solution exists. This type of problem can be solved by a stochastic 
search method. Evaluation of the quality of a solution requires creation and sim- 
ulation of fine-grained dependency graphs (Figure 15.3) on a machine or hardware 
model. A stochastic search technique well suited for parallelization is chosen. The 
next section describes the genetic algorithm [Mitchell 1998] problem formulation in 
greater detail. 



Co-optimization of mapping and routing 

Figure 15.4 presents an overview of the nested genetic algorithm (GA) for mapping 
and routing. 
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Figure 15.4. Nested genetic algorithm (GA). 

The outer GA searches over maps, while the inner GA searches over 
routes for all communication operations given a map set. 



The outer GA searches over maps, while the inner GA searches over route 
options for a given set of maps. One input to the GA is a parse graph. The outer 
GA generates a population of maps for the arrays in the parse graph. For the 
matrix- multiply operation A — A * B, there are two arrays for which maps must be 
generated, A and B. The result, in this case, is declared to have the same mapping 
as the left operand. For each <maps, parse graph> pair, a dependency graph is 
generated. While the parse graph contains operations such as matrix multiplication 
and assignment, the dependency graph contains only computation, communication, 
and memory operations, as illustrated in Figure 15.3 (bottom). 

Once the dependency graph is constructed, the communication operations 
and corresponding route choices can be enumerated. At this point, the inner GA 
assigns route choices to communication operations and evaluates the performance 
of the given <maps, routes> pair against a hardware model. Fitness evaluation is 
performed using opportunistic scheduling, and operations in the dependency graph 
are overlapped whenever possible. 

Outer GA 

The outer GA iterates over sets of maps for arrays in the computation. The mini- 
mum block size for a matrix is chosen based on the size of the matrix being mapped. 
An individual in the outer GA is represented as a set of matrices blocked accord- 
ing to minimum block size. Processors are assigned to each block, and mutation 
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Figure 15.5. Outer GA individual. 

Here, there are 36 blocks in each of the two matrices in the individual. 
Different shades of gray indicate different processors assigned to blocks. 
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Figure 15.6. Inner GA individual. 

The set of communication operations is represented as a linear array, 
with each entry in the array containing the index of a route chosen for 
the given communication operation. 

and crossover operators manipulate the processor assignments for each block. Fig- 
ure 15.5 illustrates a typical individual for a matrix multiplication operation. 

Inner GA 

The inner GA iterates over routes. The representation for the inner GA is simply 
the list of communication operations. The length of the list is equal to the number of 
communication operations for a given set of maps. Each entry in the list represents 
the index of a route chosen for a particular communication operation. Figure 15.6 
illustrates an individual for the inner GA. 

Search space 

When one is performing a stochastic search, it is helpful to characterize the size 
of the search space. The search space, S, for the nested GA formulation of the 
mapping and routing problem is given by 

S = P^r^ 

where 

P = number of processors 

B = number of blocks 

C = number of communication operations 

r — average number of route options per communication operations. 
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Table 15.1. Individual fitness evaluation times. 

Time to evaluate for some sample sparsity patterns of 1024 x 1024 ma- 
trices. The time shown is the average over 30,000 evaluations. 



Sparsity pattern 


Evaluation time (min) 


toroidal 


0.45 


power law 


1.77 



Consider an architecture with 32 processors, where for any given pair of pro- 
cessors there are four possible routes between them. For a mapping scheme involving 
just 64 blocks, or two blocks owned by each processor, and 128 communication op- 
erations, or four communication operations performed by each processor, the search 
space size is already extremely large, greater than 2 x lO^^'^. 

Parallelization of the nested GA 

Fitness evaluation of a <maps, routes> pair requires building a dependency graph 
consisting of all communication, memory, and computation operations, performing 
opportunistic scheduling of operations, and simulating the operations on a machine 
model. This evaluation is computationally expensive, as illustrated by Table 15.1. 

Since the mapping and optimization environment is written in Matlab, we 
used pMatlab to parallelize the nested GA and run it on LLGrid: Lincoln Labora- 
tory cluster computing capability [Reuther et al. 2007]. Figure 15.7 illustrates the 
parallelization process. 

As indicated by Figure 15.7, parallelization required minimal changes to the 
code (Table 15.2). A GA is well suited to parallelization since each fitness evaluation 
can be performed independently of all other fitness evaluations. 

15.2.3 Mapping performance results 

This section discusses the performance results of the maps found by using LLMOE. 
It shows that LLMOE assists in finding efficient ways of distributing sparse com- 
putations onto parallel architectures and gaining insight into the type of mappings 
that perform well. 

Machine model 

The results presented here are simulated results on a hardware or machine model. 
LLMOE allows for the machine model to be altered freely. With this, focus may 
be placed on various architectural properties that affect the performance of sparse 
computations. Table 15.3 describes the parameters of the model used for the results 
presented. 
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Initialize population of size 
N on processor 0 



^s^ibute 

f Perfo^nf parallel fitness evaluation 
N/np individuals 

^l^^ aggregate 



Perform selection and 
recombination on processor 0 




distribute 



Perform parallel fitness evaluation on 
J ^ N/np individyals 



Repeat for M 
generations 



Figure 15.7. Parallelization process. 

Fitness evaluation is performed in parallel on np processors. Selection 
and recombination are performed on the leader processor. 



Table 15.2. Lines of code. 

Parallelization with pMatlab requires minimal changes to the code. 



Serial program 


1400 


Parallel program 


1420 


% Increase 


1.4 



Table 15.3. Machine model parameters. 



Parameter 


Value 


Topology 


ring 


Processors 


8 


CPU rate 


28 GFLOPS 


CPU efficiency 


30% 


Memory rate 


256 GBytes/sec 


Memory latency 


10-8 sec 


Network rate 


256 GBytes/sec 


Network latency 


10-8 sec 
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NxM MxK 



=1 



for i = 1:M 

C = C+A(: ,i)*B(i, :) ; 
end 



Figure 15.8. Outer product matrix multiplication. 
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Figure 15.9. Sparsity patterns. 



Matrix multiplication algorithm 

LLMOE was applied to an outer product matrix multiplication algorithm (see 
[Golub Van Loan 199G]). Figure 15.8 illustrates the algorithm and the corre- 
sponding pseudocode. This algorithm was chosen because of the independent com- 
putation of slices of matrix C. This property makes the algorithm well suited for 
parallelization. 



Sparsity patterns 

LLMOE solutions should apply to general sparse matrices so LLMOE was tested on 
a number of different sparsity patterns. Figure 15.9 illustrates the sparsity patterns 
mapped in increasing order of load-balancing complexity, from random sparse to 
scrambled power law. 
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ID BLOCK 2D BLOCK 2D CYCLIC ANTI-DIAGONAL 

M ffl S H 



Figure 15.10. Benchmark maps. 

We compared our results with ttie results using standard mappings. 




Figure 15.11. Mapping performance results. 



Benchmarks 

Figure 15.10 illustrates a number of standard mappings that the results obtained 
with LLMOE were compared against. 

Results 

Figure 15.11 presents performance results achieved by LLMOE. That performance 
outperforms standard maps by more than an order of magnitude. The results are 
normalized with regard to the performance achieved using a 2D block-cyclic map, 
as that is the most commonly used map for sparse computations. In order to show 
that the results were repeatable and statistically significant over a number of runs 
of the GA, multiple runs were performed. Figure 15.12 shows statistics for 30 runs 
of the GA on a power law matrix. Observe that there is good consistency in terms 
of solution found between multiple runs of the mapping framework. 

Note that while a large number of possible solutions were considered, only 
a small fraction of the search space has been explored. For the statistics runs in 
Figure 15.12, the outer GA was run for 30 generations with 1000 individuals. The 
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Ccni'iTdliuii 



Figure 15.12. Run statistics. 

The top plot shows best overall fitness found for each generation. The 
middle plot shows average fitness for each generation. Finally, the bot- 
tom plot shows the behavior of the best of the 30 runs over 30 genera- 
tions. 

inner GA used a greedy heuristic to pick the shortest route between two nodes 
whenever possible. Thus, the total number of solutions considered was 

30 X 1000 = 30, 000 

The size of the search space per equation is 

S ^ P^r"" ^ 8i2» X - 5 X lO^^^ 

where 8 is the number of processors in the machine model; 128 is the number of 
blocks used for 256x256 matrices; O(IOO) is the number of communication opera- 
tions; 2 is the number of possible routes for each communication operation given a 
ring topology. Thus, the GA performs well in this optimization space, as it is able 
to find good solutions while exploring a rather insignificant fraction of the search 
space. 
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Conclusion 

LLMOE provides a tool to analyze and co-optimize problems requiring the parti- 
tioning sparse arrays and graphs. It allows an efficient partition of the data used 
in these problems to be obtained in a reasonable amount of time. This is possible 
even in circumstances where the search space of potential mappings is so large as 
to make the problem unapproachable by typical methods. 
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Abstract 

Graphs arc a general approach for representing information that spans 
the widest possible range of computing applications. They are partic- 
ularly important to computational biology, web search, and knowledge 
discovery. As the sizes of graphs increase, the need to apply advanced 
mathematical and computational techniques to solve these problems is 
growing dramatically. Examining the mathematical and computational 
foundations of the analysis of large graphs generally leads to more ques- 
tions than answers. This book concludes with a discussion of some of 
these questions. 
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Of the many questions relating to the the mathematical and computational 
foundations of the analysis of large graphs, five important classes appear to emerge. 
These are: 

• Ontology, schema, data model. 

• Time evolution. 

• Detection theory. 

• Algorithm scaling. 

• Computer architecture. 

These questions are discussed in greater detail in the subsequent sections. 

16.1 Ontology, schema, data model 

Graphs are a highly general structure that can describe complex relationships 
(edges) between entities (vertices). It would appear self-evident that graphs are 
a good match for many important problems in computational biology, web search, 
and knowledge discovery. However, these problems contain far more information 
than just vertices and edges. Vertices and edges contain metadata describing their 
inherent properties. Incorporating vertex/edge metadata is critical to analyzing 
these graphs. Furthermore, the diversity of vertex and edge types often makes it 
unclear which is which. 

Knowledge representation using graphs has emerged, faded, and re-emerged 
over time. The recent re-emergence is due to the increased interest in very large 
networks and the data they contain. Revisiting the first order logical basis for 
knowlcidge reprcisentation using graphs, and finding efficient representations and 
algorithms for querying graph-based knowledge representation databases is a fun- 
damental question. 

The mapping of the data for a specific problem onto a data structure for that 
problem is given by many names: ontology, schema, data model, etc. Historically, 
ontologies have been generated by experts by hand and applied to the data. Increas- 
ingly large, complex, and dynamic data sets make this approach infeasible. Thus, 
a fundamental question is how to create ontologies from data sets automatically. 

Higher order graphs (complexes, hierarchical, and hypergraphs) that can cap- 
ture more sophisticated relationships between entities may allow for more useful 
ontologies. However, higher order graphs raise a number of additional questions. 
How do we extend graph algorithms to higher order graphs? What are the perfor- 
mance benefits of higher order graphs on specific applications (e.g., pattern detec- 
tion or matching)? What is the computational complexity of algorithms running on 
higher order graphs? What approximations can be used to reduce the complexities 
introduced by higher order graphs? 

16.2 Time evolution 

Time evolution is an important feature of graphs. In what ways are the spatiotem- 
poral graphs arising from informatics and analytics problems similar to or diff'erent 
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from the more static graphs used in traditional scientific computing? Are higher 
order graphs necessary to capture the behavior of graphs as they grow? 

For static graphs, there is a rich set of basic algorithms for connectivity (e.g., 
s-t connectivity, shortest paths, spanning tree, connected components, biconnectcd 
components, planarity) and for centrality (e.g., closeness, degree, betweenness), as 
well as for flows (max flow, min cut). For dynamic graphs, are there new classes of 
basic algorithms that have no static analogs? What is the complexity of these? For 
instance, determining whether a vertex switches clusters over time (i.e., allegiance 
switching) has no static analog. Another example is detecting the genesis and 
dissipation of communities. 

A related issue is probabilistic graphs. How do we apply graph algorithms to 
probabilistic graphs, where the edges exist with some temporal probability? 

1 6.3 Detection theory 

The goal of many graph analysis techniques is to find items of interest in a graph. 
Historically, many of these techniques are based on heuristics about topological 
features of the graph that should be indicative of what is being sought. Massive 
graphs will need to be analyzed using statistical techniques. The application of 
statistical approaches and detection theory are used in other domains and should 
be explored for application to large graphs. For example, how do spectral techniques 
applied to sparse matrices apply to large graphs? Can detection be enhanced with 
spectral approaches? Spectral approaches applied to dynamic graphs provide a 
tensor perspective similar to those used in many signal processing applications that 
have two spatial and one temporal dimension. 

A key element of statistical detection is a mathematical description of the 
items of interest. The signature of an item may well be another graph that must 
be embedded in a larger graph while preserving some distance metric and doing so 
in a computationally tractable way. The optimal mapping (matching or detection) 
of a subgraph embedded in a graph is an NP-hard problem, but perhaps there 
are approximation approaches that are within reach for expected problem sizes. A 
subproblem that is of relevance to visualization and analysis is that of projecting a 
large graph into a small chosen graph. The inverse problem is constructing a graph 
from its projections. 

The second key element of statistical detection is a mathematical description 
of the background. Many real-world phenomena exhibit random-like graphical rela- 
tionships (social networks, etc.). Statistical detection tlieory relies on an extensive 
theory of random matrices. Can these results be extended to graphs? How do we 
construct random-like graphs? What are the basic properties of random-like graphs, 
and how can one derive one graph property from another? How can we efficiently 
generate random graphs with properties that mirror graphs of interest? 

16.4 Algorithm scaling 

As graphs become increasingly large (straining the limits of storage), algor- 
ithm scaling becomes increasingly important. The computational complexities of 
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feasible algorithms arc narrowing. 0(1) algorithms remain trivial, 0{N\M) al- 
gorithms are tractable along with 0{N\M \og(N\M)), providing the constants are 
reasonable. Algorithms that were feasible on smaller graphs are increasingly becom- 
ing less feasible on those that O(iV^), 0{NM), and 0{M'^). Thus, better scaling 
algorithms or approximations are required to analyze large graphs. 

Similar issues exist in bandwidth, where storage retrieval and parallel commu- 
nication are often larger bottlenecks than single-processor computation. Existing 
algorithms need to be adapted for both parallel and hierarchical storage environ- 
ments. 

1 6.5 Computer architecture 

Graph algorithms present a unique set of challenges for computing architectures. 
These include both software algorithm mapping and hardware challenges. 

Parallel graph algorithms are very difficult to code and optimize. Can parallel 
algorithmic components be organized to allow nonexperts to analyze large graphs 
in diverse and complex ways? What primitives or kernels are critical to supporting 
graph algorithms? More specifically, what should be the "combinatorial BLAS?" 
That is, what is a parsimonioiis but complete set of primitives that (a) arc pow- 
erful enough to enable a wide range of combinatorial computations, (b) are simple 
enough to hide low-level details of data structures and parallelism, and (c) allow 
efficient implementation on a useful range of computer architectures. Semiring 
sparse matrix multiplication and related operations may form such a set of prim- 
itives. Other primitives have been suggested, e.g., the visitor-based approach of 
Bcrry/Hcndrickson/Lumsdainc. Can these be reduced to a common set, or is there 
a good way for them to interoperate? 

Parallel computing algorithms generally rely on effective algorithm mappings 
that can minimize the amount of commimication required. Do these graphs con- 
tain good separators that can be exploited for partitioning? If so, efficient parallel 
algorithm mappings can be found that will work well on conventional parallel com- 
puters. Can efficient mappings be found that can be applied to a wide range of 
graphs? Can the recursive Kronecker structure of a power law graph be exploited 
for parallel mapping? How sensitive is parallel performance to changes in the graph 
(both with additions/removals of edges and as the graph grows in size)? 

From a hardware perspective, graph processing typically involves randomly 
hopping from vertex to vertex in memory, and experiencing bad memory locality. 
Modern architectures stress fast memory bandwidth, but typically high latency. It is 
not clear if the shift to multicore may make this problem better or worse, as typical 
multicore machines only have a single memory subsystem, resulting in multiple 
processors competing for the same memory. In distributed computing and clusters, 
bad memory locality translates into bad processor locality for computations where 
the data (graph) is distributed. Rather than poor performance due to memory 
latency, clusters typically see poor performance on graph algorithms due to high 
communication costs. 
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The underlying issue is data locality. Modern architectures are not optimized 
for computations with poor data locality, and the problem only looks to get worse. 
Casting graph algorithms by using an adjacency matrix formulation may help. The 
linear algebra community has a broad background body of work in distributing ma- 
trix operations for better data locality. Fundamentally, what architectural features 
could be added to complex, heterogeneous many-core chips to make them more 
suitable for these applications? 



Downloaded 09 Deo 2011 to 129.174.55.245. Redistribution subjeotto SIAM license or copyright; see http://www.siam.org/journals/ojsa.php 



Index 



2D and 3D mesh graphs, 39 

adjacency matrix, 5, 13 

Bellman-Ford, 26, 46 
bibliometric, 86 
bipartite clustering, 246 
bipartite graph, 150, 213 
block distribution, 17 
Brandes' algorithm, 69 
breadth-first search, 32 

centrality, 256 

betweenness, 257 

closeness, 256 

degree, 256 

parallel, 259 

stress, 257 
clustering, 237 

compressed sparse column (CSC), 305 
compressed sparse row (CSR), 305 
computer architecture, 356 
connected components, 20, 33 
connectivity, 149 
cyclic distribution, 17 

degree distribution, 141, 147 
dendragram, 238 
densification, 142, 150 
detection, 355 

diameter, 141, 151, 219, 232 

small, 141 
distributed arrays, 17 



dual-type vertices, 117 
dynamic programming, 23 

edge betweenness centrality, 78 
edge/vertex ratio, 243 
edges, 14 

effective diameter, 141, 151 
eigenvalues, 141, 148, 219, 233 
eigenvectors, 141, 148 
Erdos-Renyi graph, 39, 142 
explicit adjacency matrix, 209 
exponential random graphs, 143 

Floyd-Warshall, 53 
fundamental operations, 291 

genetic algorithm, 345 
graph, 13 

graph clustering, 59 
graph component, 163 
graph contraction, 35 
graph fitting, 181 
graph libraries, 30 
graph partitioning, 38 
graph-matrix duality, 30 

hidden Markov model, 118 
HITS, 86 
hop plot, 141 

input/output (I/O) complexity, 288 
instance adjacency matrix, 211 
iso-parametric ratio, 219, 234 

359 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journais/ojsa.php 



360 



Index 



Kronecker graphs, 120, 212 
deterministic, 161 
fast generation, 157 
generation, 143 
interpretation, 155 
real, 161 

stochastic, 152, 161 
Kronecker product, 144 
other names, 147 

Lincoln Laboratory Mapping and Opti- 
mization Environment (LLMOE) 
343 

Luby's algorithm, 35 

Markov clustering, 68 
Matlab notation, 30 
matrix 

Hadamard product, 16 

Kronecker product, 16 

multiplication, 16 
matrix addition, 291 
matrix exponentiation, 25 
matrix graph duality, 30 
matrix matrix multiply, 292 
matrix powers, 25 
matrix vector multiply, 291 
matrix visualization, 245 
maximal independent set, 35 
memory hierarchy, 290 
minimum paths, 26 
minimum spanning tree, 55 
monotype vertices, 116 

network growth, 245 
node correspondence, 143 

ontology, 354 

model, 143 
PageRank, 86 
parallel 

coarse grained, 262 

fine grained, 262 
parallel mapping, 344 
parallel partitioning, 255 
path distributions, 118 



peer pressure, 59 
permutation, 220 
power law, 141 
preferential attachment, 142 
prim, 56 

probability of detection (PD), 124 
probability of false alarm (PFA), 124 
pseudoinverse, 90 

R-MAT graph, 39 

random access memory (RAM) complex- 
ity, 289 
random graph, 39 
row ordered triples, 298 
row- major ordered triples, 302 

scahng, 356 

schema, 354 

scree plot, 141 

semiring, 14, 32 

shrinking diameter, 142 

SI AM pubhcations, 91 

signal to noise (SNR) ratio, 124 

single source shortest path, 46 

small world, 141 

SNR 

hierarchy, 128 
social sciences, 254 
sparse, 16 

storage, 16 
sparse accumulator (SPA), 299 
sparse matrix 

multiplication, 31 
sparse reference, 291 
sparsity, 220, 226 
spherical projection, 246 
stochastic adjacency matrix, 209 

tensor, 86, 87 

factorization, 89 
Frobenius norm, 88 
Hadamard product, 88 
Khatri-Rao product, 88 
Kronecker product, 88 
matricization, 88 
outer product, 88 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



Index 



361 



time evolution, 355 
tree 

adjacency matrix, 117, 121 
triangles, 141 

unordered triples, 294 

vertex betweenness centrality, 69 
vertex interpolation, 243 
vertex/edge schema, 117 
vertices, 14 



Downloaded 09 Dec 201 1 to 1 29.1 74.55.245. Redistribution subject to SIAIVI license or copyright; see http://www.siam.org/journals/ojsa.php 



